Returns: A list of cleaned strings in the order of the original text.There are many operations and products that are being developed using Natural Language processing. If `-1`, set `multiprocessing.cpu_count()`. n_process: Number of processors to process texts. component_cfg: An optional dictionary with extra keyword arguments for specific components. disable: The pipeline components to disable. batch_size: The number of texts to buffer. Output will then be a sequence of (doc, context) tuples. as_tuples: If set to True, inputs should be a sequence of (text, context) tuples. Args: texts: A sequence of texts or docs to process. no_type_check def clean ( # noqa: F811 self, texts : Union ], Iterable, _An圜ontext ]], ], *, as_tuples : bool = False, batch_size : Optional = None, disable : Iterable = SimpleFrozenList (), component_cfg : Optional ]] = None, n_process : int = 1, ) -> List : """Clean a stream of texts. strip () = "" : return False else : return. Returns: True if the token is allowed and False if it is not allowed. If the token does not meet the conditions then it is allowed. lower () def _allowed_token ( self, tok : Token ) -> bool : """Checks if a token is allowed. _allowed_token ( tok ): continue if self. """ tokens = for tok in doc : if not self. Returns: A string of the cleaned document. If the token is allowed, then append the token to the list of tokens. DOCS: """ return def _clean_doc ( self, doc : Doc ) -> str : """Cleans a `spaCy` document. Returns: A list of cleaned strings in the order of the original text. lemmatize = lemmatize # noinspection PyTypeChecker,PyDefaultArgument,PydanticTypeChecker. remove_stopwords = remove_stopwords self. remove_punctuation = remove_punctuation self. Spac圜leanerMisconfigurationError ( "A `lemmatizer` is not in your model pipeline." ) self. POS tags will not " "be removed." ) if lemmatize and "lemmatizer" not in model. warn ( "A `tagger` is not in your model pipeline. _init_ ( model ) if remove_pos is not None and "tagger" not in model. Example: ```python nlp = spacy.load("en_core_web_sm") cleaner = Cleaner( spacy_model=nlp, lemmatize=True, remove_numbers=True, ) raw_texts = cleaner.clean(raw_texts) ``` """ def _init_ ( self, model : Language, remove_numbers : bool = False, remove_punctuation : bool = True, remove_pos : Optional ] = None, remove_stopwords : bool = True, remove_email : bool = True, remove_url : bool = True, lemmatize : bool = False, ) -> None : super (). Raises: Spac圜leanerMisconfigurationError: When attempting to lemmatize when a `lemmatizer` is not in the model pipeline. remove_email: Remove email addresses from the text. remove_stopwords: Remove stopwords from the text. For example, if you want to remove all nouns, you can pass in ``. remove_pos: A list of POS tags to remove. remove_punctuation: Remove punctuation from the text. remove_numbers: Remove numbers from the text. Class Cleaner ( BaseCleaner ): """Cleans text using SpaCy! Args: model: The `spaCy` model to use.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |