CN114003682A

CN114003682A - Text classification method, device, equipment and storage medium

Info

Publication number: CN114003682A
Application number: CN202111281042.3A
Authority: CN
Inventors: 何免; 龚骋伟; 符国辉; 何保健; 杨晨
Original assignee: Tongdun Technology Co ltd
Current assignee: Tongdun Technology Co ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-02-01

Abstract

The embodiment of the application discloses a text classification method, a text classification device, text classification equipment and a storage medium, wherein the method comprises the following steps: converting text data to be classified into text vectors to be classified; inputting the text vectors to be classified into a pre-established Faiss index model for similarity index so as to output the text vectors in N training sets with the highest similarity with the text vectors to be classified, and taking the text data corresponding to the output text vectors as an alternative set; attaching category labels to the text data in the training set, wherein the category labels are used for representing the categories to which the text data belong; acquiring feature words of categories corresponding to various category labels in a training set according to a TF-IDF technology; and determining the category of the text data to be classified according to the candidate category and the feature words corresponding to the candidate category contained in the text data to be classified, so that the speed of text classification and the accuracy of classification are improved.

Description

Text classification method, device, equipment and storage medium

Technical Field

The application relates to a text classification technology, and provides a text classification method, a text classification device, text classification equipment and a storage medium.

Background

With the advent of the mobile internet era, the production and propagation of text information have been changed profoundly, so as to meet the diversified demands of users under the background of information explosion, the text information needs to be organized effectively, and information retrieval and data mining are gradually concerned as a way of organizing the text information effectively.

The text classification technology is the basis of data mining and information retrieval, and the main task of the text classification technology is to classify corresponding texts according to text contents. When the text is classified, the text classification mainly comprises the text classification based on a support vector machine and the text classification based on a convolutional neural network, wherein the text classification based on the support vector machine is used for constructing text features based on word frequency and neglecting semantic information in the text; the support vector machine has a good effect on a small-scale data set, and the effect improvement is not obvious for a scene of a large-scale data set; text classification based on the convolutional neural network adopts convolutional kernels with different lengths to extract text features, although semantic information is considered, the text classification is limited by the length of the convolutional kernels, the semantic information is easy to lose for long sentences, and partial semantic features are lost in the process of maximum pooling, so that inaccurate classification is caused; therefore, how to accurately and rapidly classify texts in a large-scale data set is an urgent problem to be solved.

Disclosure of Invention

The application aims to provide a text classification method, a text classification device, text classification equipment and a storage medium, the problem of text classification is converted into the problem of vector similarity index, the speed of text classification is improved, the most appropriate classification is selected for data to be classified from an alternative set by combining feature words under classes, and the accuracy of classification is improved.

The application provides a text classification method, which comprises the following steps: converting text data to be classified into text vectors to be classified; inputting the text vector to be classified into a pre-established Faiss index model for similarity index so as to output the text vector in N training sets with the highest similarity with the text vector to be classified, and taking the text data corresponding to the output text vector as an alternative set; n is a positive integer greater than or equal to 2; attaching a class label to the text data in the training set, wherein the class label is used for representing a class to which the text data belongs; obtaining the feature words of the categories corresponding to the labels of the categories in the training set according to a TF-IDF technology; and taking the category corresponding to the text data in the alternative set as an alternative category, and determining the category of the text data to be classified according to the alternative category and the feature words corresponding to the alternative category and contained in the text data to be classified.

Further, before the converting the text data to be classified into the text vector to be classified, the method further comprises: determining a deactivation word bank corresponding to a service scene to which the text data to be classified belongs; and preprocessing the text data to be classified according to the stop word library so as to filter stop words in the text data to be classified.

Further, the converting the text data to be classified into the text vector to be classified includes: and inputting the preprocessed text data to be classified into a pre-training model BERT to obtain the text vector to be classified.

Further, before inputting the text vector to be classified into a pre-established Faiss index model for similarity index, the method includes: and establishing the Faiss index model based on the text vectors corresponding to the text data in the training set and the internal clustering index.

Further, after the establishing the Faiss index model based on the training set vector corresponding to the training set and the internal cluster index, the method further includes: transmitting the Faiss index model to each actuator on a distributed computing cluster through a broadcast mechanism; the inputting the text vector to be classified into the Faiss index model for similarity index comprises: and performing similarity index on the text vector to be classified by using the Faiss index model in the actuator through a calculation engine.

Further, the obtaining of the feature words of the categories corresponding to the labels of the categories in the training set according to the TF-IDF technology includes: performing word segmentation processing on each category of text data in the training set to obtain word segmentation results of each category of text data; calculating TF-IDF values of all words in the text data of each category according to the word segmentation result and the TF-IDF technology; and selecting K words with the highest TF-IDF value in each category as feature words of the category, wherein K is a positive integer.

Further, the determining the category of the text data to be classified according to the candidate category and the feature word corresponding to the candidate category included in the text data to be classified includes: carrying out duplicate removal processing on the alternative categories to obtain each target alternative category; performing word segmentation processing on the text data to be classified, searching whether feature words corresponding to all target alternative categories are contained in obtained word segmentation results, and taking the contained feature words as target feature words; calculating target TF-IDF values of the text to be classified under each target alternative category according to the TF-IDF values of the target characteristic words and the occurrence times of the target characteristic words in the text data to be classified; and taking the target candidate category corresponding to the highest target TF-IDF value as the category of the text data to be classified.

The present application further provides a text classification device, including: the vector conversion module is used for converting the text data to be classified into text vectors to be classified; the data indexing module is used for inputting the text vector to be classified into the preset Faiss indexing model for similarity indexing so as to output the text vector in the N training sets with the highest similarity with the text vector to be classified, and taking the text data corresponding to the output text vector as an alternative set; n is a positive integer greater than or equal to 2; attaching a class label to the text data in the training set, wherein the class label is used for representing a class to which the text data belongs; the characteristic acquisition module is used for acquiring the characteristic words of the categories corresponding to the labels of the categories in the training set according to the TF-IDF technology; and the text classification module is used for taking the category corresponding to the text data in the candidate set as a candidate category and determining the category of the text data to be classified according to the candidate category and the feature words corresponding to the candidate category and contained in the text data to be classified.

The present application further proposes an electronic device, which includes: a memory storing computer readable instructions; a processor reading computer readable instructions stored by the memory to perform the method as described above.

The present application also proposes a computer-readable storage medium having stored thereon computer-readable instructions which, when executed by a processor of a computer, cause the computer to perform the method as described above.

Compared with the prior art, the method has the following beneficial effects:

in the technical scheme provided by the application, text data to be classified is converted into text vectors to be classified; inputting the text vectors to be classified into a pre-established Faiss index model for similarity index so as to output the text vectors in N training sets with the highest similarity with the text vectors to be classified, and taking the text data corresponding to the output text vectors as an alternative set; n is a positive integer greater than or equal to 2; attaching category labels to the text data in the training set, wherein the category labels are used for representing the categories to which the text data belong; acquiring feature words of categories corresponding to various category labels in a training set according to a TF-IDF technology; determining the category of the text data to be classified according to the alternative categories corresponding to the text data in the alternative set and the feature words corresponding to the alternative categories contained in the text data to be classified; similarity indexing is carried out on the text data to be classified after the vector quantization through a Faiss indexing model, the problem of text classification is converted into the problem of vector similarity indexing, the speed of text classification is improved, and the Faiss indexing model has high efficiency in searching mass data; meanwhile, the feature words of each category of the training set are obtained through the TF-IDF technology, the category of the text data to be classified is determined from the alternative categories through the feature words corresponding to the alternative categories and contained in the text data to be classified based on the alternative set with close similarity, and inaccuracy caused by the fact that the classification category is determined based on the text similarity is avoided.

Drawings

FIG. 1 is a flow diagram illustrating a method of text classification in an exemplary embodiment of the present application;

FIG. 2 is a flow chart in an exemplary embodiment before step S110 in the embodiment shown in FIG. 1;

FIG. 3 is a flow chart of step S130 in the embodiment shown in FIG. 1 in an exemplary embodiment;

FIG. 4 is a flow chart in an exemplary embodiment of step S140 in the embodiment shown in FIG. 1;

FIG. 5 is a flow diagram illustrating another method of text classification in an exemplary embodiment of the present application;

FIG. 6 is a flow diagram of the selection of categories for data to be classified in the embodiment shown in FIG. 5 in an exemplary embodiment;

FIG. 7 is a schematic diagram of a text classification apparatus shown in an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of another text classification apparatus shown in an exemplary embodiment of the present application;

fig. 9 shows a schematic structural diagram of an electronic device suitable for implementing embodiments of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It should also be noted that: reference to "a plurality" in this application means two or more. "and/or" describe the association relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Referring to fig. 1, fig. 1 is a flowchart illustrating a text classification method according to an exemplary embodiment, where the text classification method includes steps S110 to S140, and the following is described in detail:

and S110, converting the text data to be classified into text vectors to be classified.

In this embodiment, the text data to be classified is the text data which needs to be classified and has no determined classification category, and the text data to be classified is converted into vector expression, so as to better support text similarity calculation subsequently and perform text classification.

It should be noted that before the text data to be classified is converted into the text vector to be classified, the text data to be classified may also be preprocessed, and stop words in the text data to be classified are filtered, so as to reduce the negative influence of the stop words on model classification. Specifically, as shown in fig. 2, fig. 2 is a flowchart in an exemplary embodiment before step S110 in the embodiment shown in fig. 1, and the text classification method further includes:

s210, determining a deactivation word bank corresponding to a service scene to which the text data to be classified belongs;

in this embodiment, the service scene to which the text data to be classified belongs is a service scene to which the text data to be classified is applied, such as a news scene, a game scene, and the like; in an example of this embodiment, a service scenario to which text data to be classified belongs may be determined according to a format of the text data to be classified, for example, the text data belonging to a news scenario has a news format, that is, consists of a title, a heading, a body, and a tail; the text data belonging to the contract scene has a title, a main body, a text, a signature and the like; in another example of this embodiment, a service scene to which the text data to be classified belongs may also be determined according to high-frequency scene words appearing in the text data to be classified, for example, if 5 times of scene words "game" appear in the text data to be classified, the service scene to which the text data to be classified belongs is taken as a game scene.

In this embodiment, a deactivation lexicon is respectively set for each service scene in advance, wherein the service scenes are different, and the deactivation lexicons may be the same or different; the words to be filtered are contained in the disabled word bank, for example, the disabled word bank includes words or words common in the scene field of some texts, words such as "news" in news-like texts, words such as "contract" in contract-like texts, and the like; the inactive word library also comprises a plurality of word assist words, adverbs, prepositions, conjunctions and the like, which generally have no definite meaning, and only can be put into a complete sentence to play a certain role, such as the common words "in" and "in".

S220, preprocessing the text data to be classified according to the stop word library so as to filter stop words in the text data to be classified.

And searching whether the text data to be classified has stop words corresponding to the stop word bank, if so, deleting the stop words corresponding to the stop word bank from the text data to be classified, and the purpose of doing so is to avoid the influence of the words in the subsequent feature extraction process, thereby improving the accuracy of classification.

In this embodiment, the step S110 of converting the text data to be classified into the text vector to be classified specifically includes: inputting the preprocessed text data to be classified into a BERT (bidirectional Encoder reproduction from transformations) model to obtain a text vector to be classified. BERT is a pre-training model; the method has the core that a multi-layer self-attention mechanism is adopted, the correlation between words is continuously calculated, text information is fused into vector expression, through the multi-layer self-attention process, not only can shallow semantics of the text be extracted, but also deep semantics in the text can be excavated, more importantly, BERT is pre-trained on the basis of a large amount of data, parameters are optimized, namely, 'common sense' of some languages is mastered by the BERT model, so that the text is subjected to semantic feature extraction for vectorization through the pre-trained BERT, text similarity calculation can be better supported, and subsequent text classification is facilitated.

In one example of this embodiment, text data is converted to text vectors by BERT-as-service, which is a BERT service open source to Tencent AI Lab, which allows users to use BERT models in a manner that calls services without concern for BERT implementation details. The BERT-as-service is divided into a client and a server, and a user can call service from python codes and can also access the service in an http mode; text data can be converted into a vector form through BERT-as-service, and dimensions of all the text data after being quantized are 768 dimensions, so that subsequent Faiss index construction can be facilitated.

S120, inputting the text vector to be classified into a pre-established Faiss index model for similarity index, so as to output the text vector in N training sets with the highest similarity with the text vector to be classified, and taking the text data corresponding to the output text vector as an alternative set, wherein N is a positive integer greater than or equal to 2.

In this embodiment, the Faiss index model is established based on a text vector corresponding to text data in a training set, a category label is attached to the text data in the training set, the category label is used for representing a category to which the text data belongs, that is, the text data in the training set is classified text data, the category label is used for indicating which category the classified text data belongs to, and the text data in the training set is used for model training; specifically, text data in a training set is converted into text vectors of the training set, and similarly, the text data in the training set is input into a BERT model to obtain the text vectors of the training set, and the text vectors of the training set are input into a Faiss model to establish indexes, so that a Faiss index model is obtained. The faces (facebook AI Similarity search) is an open source library, and provides an efficient and reliable Similarity retrieval method for mass data in a high-dimensional space. Faiss ameliorates the problem with the brute force search algorithm in two ways: reducing the space occupation and accelerating the retrieval speed. At the heart of Faiss is the concept of an index (index), which encapsulates a set of vectors for efficient vector retrieval.

Faiss provides multiple ways to construct the index, with the most common brute force search (calculate Euclidean distance): indexFlatL 2; memory-efficient indexes for compressed data: principal Component Analysis (PCA), Product-quantification; indexing to speed up retrieval speed with intra-clustering: ivf (incorporated File system); in this embodiment, an index is built by adopting an IVF internal clustering mode, that is, a Faiss index model is built by adopting an internal clustering index, and the IVF index can accelerate the retrieval speed on the basis of ensuring the classification accuracy.

The text vectors of the training set are input into a Faiss model for training, the Faiss model can be clustered based on the similarity of the vectors in the training process, and the Faiss index model is serialized to export a memory after the training is finished and is stored in a physical file form, so that the subsequent use is facilitated.

In this embodiment, the text vector to be classified is input to a Faiss index model for similarity retrieval, the Faiss index model finds out a text vector with the highest similarity to the text vector to be classified in a training set, because one text vector corresponds to one text vector in the training set, when the text vector with the highest similarity is found out, the text data corresponding to the text vector with the highest similarity can be obtained, because the index is constructed by adopting an IVF index, the interior of the fais index model is clustered based on the vector similarity, the similarity retrieval of the input text vector to be classified is only compared with the centroid of each cluster in the cluster, so as to accelerate the retrieval speed, and then the N text vectors with the highest similarity to the text vector to be classified in the index are output, and the topN text data with the highest similarity corresponding to the N text vectors with the highest similarity are found out as a candidate set, the topN pieces of text data are text data in the training set, because the piece of text data with the highest similarity may not belong to the same category as the text data to be classified. Under the condition that the similarity of the texts is not large, the text data to be classified is expected to have more feature words under the corresponding category, and the accuracy of subsequent text classification is improved by outputting topN pieces of text data in a training set with the highest similarity as an alternative set, so that the irrationality that only the category corresponding to the text data with the highest similarity is selected as the category of the text data to be classified is avoided.

In the embodiment, the BERT has strong feature extraction capability in the aspect of semantic extraction of the text, Faiss has high efficiency in mass data retrieval, and the speed and the accuracy of text classification are improved by combining the BERT and the Faiss.

It is to be noted that, after the fabss index model is built by using the internal cluster index, the text classification method provided in this embodiment further includes: transmitting the Faiss index model to each actuator on the distributed computing cluster through a broadcasting mechanism; at this time, inputting the text vector to be classified into a Faiss index model for similarity index includes: and performing similarity index on the text vector to be classified by using a Faiss index model in an actuator through a computing engine. The method comprises the steps of transmitting a Faiss index model to each actuator on a distributed computing cluster through a broadcasting mechanism, deploying the Faiss index model in a distributed mode, using a computing engine to perform similarity index on text vectors to be classified by using the Faiss index model in each actuator, achieving distributed use of the Faiss index model, and improving the speed of text retrieval.

Illustratively, a broadcast (broadcast) mechanism is adopted to transmit a trained model once in each executor on a Hadoop (Hadoop is a software framework capable of performing distributed processing on a large amount of data) cluster by using a zookeeper (zookeeper is a distributed and open-source distributed application coordination service, which is an implementation of Chubby of Google, and is an important component of Hadoop and Hbase), and then a Spark (Spark is a fast and general computing engine specially designed for large-scale data processing) is used to implement the search classification process in each executor, so as to accelerate the computation.

And S130, acquiring the feature words of the categories corresponding to the labels of the categories in the training set according to the TF-IDF technology.

It can be understood that, because the text data of the training set is attached with the class label, and the class label is used for characterizing the class to which the text data belongs, the training set includes various classification classes, and each classification class has a feature word representing the class, that is, the feature word is used for characterizing the class characteristics.

In an example of the present application, as shown in fig. 3, fig. 3 is a flowchart in an exemplary embodiment of step S130 in the embodiment shown in fig. 1, and a process of acquiring feature words of a category corresponding to each category label in a training set includes:

s131, performing word segmentation processing on the text data of each category in the training set to obtain word segmentation results of the text data of each category;

word segmentation processing, namely segmenting a Chinese character sequence to obtain independent words; the word segmentation processing can be performed on the text data of each category in the training set by adopting a dictionary-based word segmentation algorithm and/or a statistic-based machine learning algorithm. The word segmentation algorithm based on the dictionary is to match a character string to be matched with a word in an established 'sufficiently large' dictionary according to a certain strategy, if a certain entry is found, the matching is successful, and the word is recognized; the word segmentation algorithm based on the dictionary is divided into the following types: a forward maximum matching method, a reverse maximum matching method, a bidirectional matching word-segmentation method and the like; the statistical-based machine learning algorithm is a sequence labeling problem, in which words in a sentence are labeled according to their positions in the words, and the statistical-based machine learning Model includes an N-gram (N-gram), a Hidden Markov Model (HMM), a maximum entropy Model (ME), and a Conditional Random field Model (CRF).

For example, the training set has categories A-C, wherein the category A comprises text data a1 and a 2; and the category B comprises text data B1 and B2, the category C comprises text data C1-C3, and the text data a1, a2, B1, B2 and C1-C3 are subjected to word segmentation respectively to obtain word segmentation results a '1, a' 2, B '1, B' 2 and C '1-C' 3.

S132, calculating TF-IDF values of all words in the text data of each category according to the word segmentation result and the TF-IDF technology.

TF-IDF (term frequency-inverse document frequency) is a common weighting technique used for information retrieval and data mining; for evaluating the importance of a word to a set of documents or a word to one of the documents in a corpus; TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency). The main idea of TF-IDF is: if a word or phrase appears frequently in one article (i.e. has a high TF) and rarely appears in other articles (i.e. has a high IDF), the word or phrase is considered to have a good classification capability and is suitable for classification; the word is just the required key characteristic word; based on the principle, a plurality of characteristic words can be found in each text and the weight TF-IDF occupied by the characteristic words is calculated; TF-IDF, which ranges from [0,1 ]; the sum of the TF-IDF values of all the words in the same text is 1; wherein the content of the first and second substances,

for example, in a news report reporting agricultural cultivation in China, the most common words (the 'of', the 'of' and the 'of' are ') are calculated and determined by a TF-IDF technology to obtain the minimum TF-IDF value, the more common words (the' of 'and' the 'of' and the 'of' are ') are calculated and determined to obtain the smaller TF-IDF value, the less common words (the' of 'and the' of 'are') are calculated and determined to obtain the words with strong depicting ability, and the TF-IDF value is required to be the highest.

S133, selecting K words with the highest TF-IDF value in each category as feature words of the category, wherein K is a positive integer.

For each category, counting TF-IDF values of all words in each text data under the category, and selecting K words with the highest TF-IDF values as feature words of the category, wherein K can be positive integers of 1, 2, 3 and the like; bearing the above example, assume that K is 2, the TF-IDF value of word 1 in the file data a1 is 0.48, and the TF-IDF value of word 2 in the text data a1 is 0.52; the TF-IDF value of the word 3 in the text data a2 is 0.21, the TF-IDF value of the word 4 in the text data a2 is 0.79, then the 2 words with higher TF-IDF values in the category A are the word 2 and the word 4, and the word 2 and the word 4 are taken as the characteristic words of the category A; and similarly, obtaining the characteristic words of each category. It is understood that the TF-IDF values in different classes may be different for the same word.

And S140, taking the category corresponding to the text data in the alternative set as an alternative category, and determining the category of the text data to be classified according to the alternative category and the feature words corresponding to the alternative category and contained in the text data to be classified.

In this embodiment, because the text data in the candidate set is the text data in the training set, and the text data in the training set all correspond to a category, a category corresponding to the text data in the candidate set can be determined, and the category in the candidate set is taken as the candidate category; since the feature words of each category in the training set are obtained in step S130, the feature words corresponding to the candidate categories and the TF-IDF values corresponding to the feature words can also be determined.

In an example of the present application, as shown in fig. 4, fig. 4 shows a flowchart in an exemplary embodiment of step S140 in the embodiment shown in fig. 1, and the process of determining the category of the text data to be classified includes:

and S141, carrying out de-duplication processing on the candidate categories to obtain each target candidate category.

In this embodiment, the candidate categories are subjected to deduplication processing, that is, duplicate categories are removed to obtain each target candidate category, and feature words of each target candidate category and TF-IDF values corresponding to the feature words are determined.

In an exemplary embodiment, because the text data in the candidate set corresponds to the category to which the text data belongs, one category in the repeated categories is removed, and a target candidate category is obtained, for example, the candidate set includes text data 1 with the highest similarity to the text data to be classified, and text data 2, text data 3, and text data 4 are sequentially included with the highest similarity, assuming that the category to which the text data 1 belongs is category 1, the category to which the text data 2 belongs is category 2, the category to which the text data 3 belongs is category 1, and the category to which the file data 4 belongs is category 3, the candidate categories include category 1, category 2, category 1, and category 3, and the candidate categories are subjected to deduplication processing, that is, the repeated category 1 is removed, and the target candidate category 1, category 2, and category 3 are obtained.

In another exemplary embodiment, the candidate categories are subjected to deduplication processing, and the deduplication processing may also be performed according to the same or similar feature words of each candidate category, for example, if the percentage of the same or/and similar feature words is greater than or equal to the preset percentage of the total feature words of the candidate categories, which indicates that two candidate categories are repeated, for example, the above example is followed, assuming that the feature words of category 1 are "h 1, h 2, h 3, h 4", the feature words of category 2 are "h 1, h 2, h4, h 5", the feature words of category 3 are "h 1, h 3, h7, h 8", and if category 1 and category 2 have 3 same feature words, and h4 and h5 are feature words with similar meanings, then category 1 and category 2 are represented as repeated candidate categories; if category 1 and category 2 have 3 identical tokens, and h4 and h5 are tokens with non-similar meanings, but the tokens with the same token in category 1 and category 2 account for 75% of category 1 and category 2, which is equal to a preset percentage of 75%, indicating that category 1 and category 2 are the same category, then category 1 or category 2 is removed.

And S142, performing word segmentation on the text data to be classified, searching whether the obtained word segmentation result contains the feature words corresponding to the target alternative categories, and taking the contained feature words as the target feature words.

Performing word segmentation on the text data to be classified to obtain word segmentation results of the text data to be classified, and searching whether the word segmentation results of the text data to be classified contain feature words corresponding to the target candidate categories or not for each target candidate category, for example, the word segmentation results of the text data to be classified include words "w 1, w2, w3 and w 4", the feature words of the target candidate category 1 are w1 and w2, and the feature words of the target candidate category 2 are w2 and w 3; the text data to be classified includes the feature words corresponding to the target candidate class 1 and the target candidate class 2, and w1, w2 and w3 are used as the target feature words.

S143, calculating the target TF-IDF value of the text data to be classified under each target alternative category according to the TF-IDF value of the target characteristic word and the occurrence frequency of the target characteristic word in the text data to be classified.

Acquiring the occurrence times of the target characteristic words in the text data to be classified, and taking on the above examples, such as w1, w2 and w3, which respectively occur 3 times, 5 times and 2 times in the text data to be classified; as before, the feature words corresponding to the candidate categories and the TF-IDF values corresponding to the feature words have been determined, so the TF-IDF values of the target feature words can also be obtained, for example, the feature words of the target candidate category 1 are w1 and w2, and the corresponding TF-IDF values are k1 and k 2; the characteristic words of the target candidate category 2 are w2 and w3, and the corresponding TF-IDF values are k3 and k 4; determining target TF-IDF values of the text data to be classified under each target alternative category according to the TF-IDF values of the target characteristic words and the occurrence times of the target characteristic words in the text data to be classified; in the above example, the target TF-IDF value of the text data to be classified under each target candidate category 1 is "k 1 × 3+ k2 × 5", and the target TF-IDF value of the text data to be classified under each target candidate category 2 is "k 3 × 5+ k 4".

And S144, taking the target candidate category corresponding to the highest target TF-IDF value as the category of the text data to be classified.

In the example, if the target TF-IDF value "k 1 × 3+ k2 × 5" of the text data to be classified in each target candidate category 1 is greater than the target TF-IDF value "k 3 × 5+ k 4" of the text data to be classified in each target candidate category 2, the target candidate category is used as the category of the text data to be classified.

In another example of this embodiment, determining the category of the text data to be classified may further be performing word segmentation on the text data to be classified, determining an object candidate category that includes the most feature words in the text data to be classified from a word segmentation result, and taking the object candidate category that includes the most feature words as the category of the text data to be classified; namely, the category of the text data to be classified containing the most feature words is the target alternative category 3, and the target alternative category 3 is taken as the category of the text data to be classified. Preferably, when determining the feature words of the target candidate categories contained in the text data to be classified from the word segmentation result, determining the TF-IDF value of each feature word in each target candidate category, then calculating the total TF-IDF value of the feature words contained in the text data to be classified under each target candidate category, and taking the target candidate category corresponding to the total TF-IDF value as the category of the text data to be classified; for example, the word segmentation result of the text data to be classified comprises feature words "w 1, w2, w3 and w 4", the feature words of the target candidate class 1 are w1, w2 and w4, and the corresponding TF-IDF values are k1, k2 and k 5; the feature words of the target candidate category 2 are w2 and w3, the corresponding TF-IDF values are k3 and k4, the total TF-IDF value of the text data to be classified under the target candidate category 1 is (k1+ k2+ k5), the total TF-IDF value of the text data to be classified under the target candidate category 2 is (k3+ k4), and if (k3+ k4) > (k1+ k2+ k5), the target candidate category 2 is taken as the category of the text data to be classified.

In another example of this embodiment, determining the category of the text data to be classified may further be performing word segmentation on the text data to be classified, and when determining one belonging target candidate category containing the most feature words in the text data to be classified from a word segmentation result, taking the belonging target candidate category containing the most feature words as the category of the text data to be classified; when the word segmentation result determines that the text data to be classified comprises at least two same maximum feature word numbers, whether feature words corresponding to all target candidate categories are contained or not is searched from the word segmentation result, the contained feature words are used as target feature words, target TF-IDF values of the text data to be classified under all the target candidate categories are calculated according to TF-IDF values of the target feature words and the occurrence times of the target feature words in the text data to be classified, and the target candidate category corresponding to the highest target TF-IDF value is used as the category of the text data to be classified. For example, if the text data to be classified contains 4 feature words of the target candidate category 1, 5 feature words of the target candidate category 2, and 3 feature words of the target candidate category 3, the target candidate category 2 is used as the text data to be classified; if the text data to be classified contains 5 feature words of the target candidate category 1, 5 feature words of the target candidate category 2, and 3 feature words of the target candidate category 3, and there are two identical target candidate categories containing 5 feature words, the category of the text data to be classified is determined with reference to steps S143 to S144.

For convenience of understanding, the present embodiment further provides a text classification method, in which data is divided into a training set, a cross validation set, and a test set, where text data in the training set has a category label indicating which category the text belongs to, and the part of data is used for model training; the text data of the verification set is also provided with a category label, but the part of the data is used for testing the accuracy of the model; the test set is the text data to be classified.

As shown in fig. 5, stop words in a group of fields need to be maintained in combination with a service scenario to reduce their negative impact on model classification, so all text data need to be preprocessed to filter stop words; performing semantic feature extraction on the training set and the text data to be classified after the stop words are filtered by using BERT, and converting the semantic feature extraction into a vector form; inputting the data of the text vector of the training set into a Faiss model to establish an index, selecting an IVF internal clustering mode to establish the Faiss index model, inputting the text data of the verification set into the Faiss index model after the Faiss index model is trained, testing the accuracy of the model, comparing the category of the text data output by the Faiss index model with the category of the text data of the verification set, determining the accuracy, and classifying by using the Faiss index model if the accuracy is higher than a preset accuracy threshold, and if the accuracy is higher than the preset accuracy threshold by 90%.

And (3) inputting vector data of the text data to be classified into a model for similarity retrieval, finding out a text vector of a training set with the highest similarity with the input text vector N to be classified in the training set by the Faiss index model, and taking the text data corresponding to the output text vector as an alternative set.

Calculating the TF-IDF value of each word in each category in the training set, and selecting the first K words with the highest TF-IDF value under the corresponding category as the feature words of the category; and finally, selecting the most appropriate classification for the text data to be classified from the alternative set by combining the feature words under the categories.

As shown in fig. 6, a method for selecting the most suitable classification for text data to be classified from an alternative set in combination with feature words under a category is provided, which includes the following steps:

only the category of the text data in the candidate set needs to be concerned at this time, because it is actually an index for calculating the TF-IDF value of the text data to be classified under each category to select the optimal classification from a plurality of categories. All the categories of the alternative collection are obtained firstly, the categories corresponding to the alternative collection are normalized, namely, the categories are subjected to de-duplication, and each category has a batch of feature words and TF-IDF values occupied by the feature words. As shown in fig. 6, the feature words in category 1 are "w 11, w12, w13, w14 …", the corresponding TF-IDF values are "k 11, k12, k13, k14 …", the feature words in category 2 are "w 21, w22, w23, w24 …", the corresponding TF-IDF values are "k 21, k22, k23, k24 …", respectively; wherein different words in the same class have different TF-IDF values, and more representative words in the class have higher TF-IDF values, so as to better classify the words.

And then performing word segmentation on the text data to be classified, finding out the characteristic words existing in the category from the word segmentation result of the text data to be classified under each category of the alternative set, if the characteristic words exist, counting the characteristic words as 1, multiplying the characteristic words existing in the category by combining the weights of the characteristic words under the category, and adding the characteristic words to calculate the TF-IDF value of the text data to be classified under the category. As in category 1, if three words "w 11, w13, w 14" appear in the text data to be classified, the TF-IDF value of the text data to be classified in category 1 is "1 × k11+1 × k13+1 × k 14". The text data to be classified is more likely to belong to the class having the highest TF-IDF value calculated finally because more feature words of this class are included in the text data to be classified.

In the technical scheme provided by the embodiment, on one hand, an IVF index is constructed by using Faiss opposite-direction quantized text data, the problem of text classification is converted into the problem of vector similarity index, the speed of text classification is improved, the optimal classification is selected from alternative classes by using a TF-IDF method, the calculation is performed from the perspective of text similarity, the weight of class feature words in the classification is increased, and the classification strategy is fused with the cognitive rules of people; on the other hand, the broadcast mechanism of Spark is distributed to use the Faiss model, so that the text classification speed is increased, the BERT has strong feature extraction capability in the aspect of semantic extraction of texts, the Faiss has high efficiency in mass data retrieval, and the technical scheme provided by the embodiment combines the Faiss model and the fais model, so that the text classification speed and accuracy are increased.

Embodiments of the apparatus of the present application are described below, which may be used to perform the text-based classification method in the above-described embodiments of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the text classification method described above in the present application.

As shown in fig. 7, fig. 7 is a schematic diagram of a text classification apparatus according to an exemplary embodiment of the present application, including:

a vector conversion module 710, configured to convert text data to be classified into text vectors to be classified;

the data indexing module 720 is used for inputting the text vectors to be classified into a pre-established Faiss indexing model for similarity indexing, outputting the text vectors in the N training sets with the highest similarity with the text vectors to be classified, and taking the text data corresponding to the output text vectors as an alternative set; n is a positive integer greater than or equal to 2; attaching category labels to the text data in the training set, wherein the category labels are used for representing the categories to which the text data belong;

the feature obtaining module 730 is configured to obtain feature words of categories corresponding to the labels of the categories in the training set according to the TF-IDF technology;

the text classification module 740 is configured to use a category corresponding to the text data in the candidate set as a candidate category, and determine the category of the text data to be classified according to the candidate category and the feature words corresponding to the candidate category included in the text data to be classified.

For example, as shown in fig. 8, the text classification apparatus further includes a preprocessing module 750, configured to determine a deactivated lexicon corresponding to a service scenario to which the text data to be classified belongs; and preprocessing the text data to be classified according to the stop word library so as to filter stop words in the text data to be classified.

For example, the vector conversion module 710 is specifically configured to input the preprocessed text data to be classified into the pre-training model BERT to obtain the text vector to be classified.

For an exemplary case, the text classification apparatus further includes a model building module 760 and a transmission module 770, where the model building module 760 is configured to build a Faiss index model based on the text vectors corresponding to the text data in the training set and the internal cluster index, and send the Faiss index model to the transmission module 770.

The transmission module 770 is used for transmitting the Faiss index model to each executor on the distributed computing cluster through a broadcast mechanism; the data indexing module 720 is specifically configured to perform similarity indexing on the text vector to be classified by using a Faiss indexing model in the executor through the computation engine.

For example, as shown in fig. 8, the text classification apparatus further includes a TF-IDF processing module 780, where the feature obtaining module 730 is configured to perform word segmentation processing on each category of text data in the training set to obtain a word segmentation result of each category of text data; the TF-IDF processing module 780 is used for calculating TF-IDF values of all words in the text data of each category according to the word segmentation result and the TF-IDF technology; the feature obtaining module 730 is further configured to select K words with the highest TF-IDF value in each category as feature words of the category, where K is a positive integer.

For an exemplary case, the text classification module 740 is configured to perform deduplication processing on the candidate categories to obtain target candidate categories; performing word segmentation processing on the text data to be classified, searching whether the obtained word segmentation result contains the feature words corresponding to the target alternative categories, and taking the contained feature words as target feature words; the TF-IDF processing module 780 is further configured to calculate a target TF-IDF value of the text data to be classified under each target candidate category according to the TF-IDF value of the target feature word and the occurrence frequency of the target feature word in the text data to be classified; the text classification module 740 is further configured to use the target candidate category corresponding to the highest target TF-IDF value as the category of the text data to be classified.

It should be noted that the apparatus provided in the foregoing embodiment and the method provided in the foregoing embodiment belong to the same concept, and the specific manner in which each module and unit execute operations has been described in detail in the method embodiment, and is not described again here.

In an exemplary embodiment, a computer device includes a processor and a memory, wherein the memory has stored thereon computer readable instructions which, when executed by the processor, implement the method as previously described.

FIG. 9 is a schematic diagram illustrating a configuration of a computer device, according to an example embodiment.

It should be noted that the computer device is only one example adapted to the application and should not be considered as providing any limitation to the scope of use of the application. Nor should the computer device be interpreted as having a need to rely on or have to have one or more components of the exemplary computer device shown in fig. 9.

As shown in fig. 9, in an exemplary embodiment, the computer device includes a processing component 901, a memory 902, a power component 903, a multimedia component 904, an audio component 905, a sensor component 907, and a communication component 908. The above components are not all necessary, and the computer device may add other components or reduce some components according to its own functional requirements, which is not limited in this embodiment.

The processing component 901 generally controls overall operation of the computer device, such as operations associated with display, data communication, and log data processing. The processing component 901 may include one or more processors 909 to execute instructions to perform all or a portion of the above-described operations. Further, the processing component 901 may include one or more modules that facilitate interaction between the processing component 901 and other components. For example, the processing component 901 may include a multimedia module to facilitate interaction between the multimedia component 904 and the processing component 901.

The memory 902 is configured to store various types of data to support operation at the computer device, examples of which include instructions for any application or method operating on the computer device. The memory 902 has stored therein one or more modules configured to be executed by the one or more processors 909 to perform all or part of the steps of the methods described in the embodiments above.

The power supply component 903 provides power to the various components of the computer device. The power components 903 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for a computer device.

The multimedia component 904 includes a screen that provides an output interface between the computer device and the user. In some embodiments, the screen may include a TP (Touch Panel) and an LCD (Liquid Crystal Display). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

Audio component 905 is configured to output and/or input audio signals. For example, the audio component 905 includes a microphone configured to receive external audio signals when the computer device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. In some embodiments, audio component 905 also includes a speaker for outputting audio signals.

Sensor component 907 includes one or more sensors for providing various aspects of status assessment for the computer device. For example, the sensor component 907 may detect an on/off state of the computer device and may also detect a temperature change of the computer device.

The communication component 908 is configured to facilitate wired or wireless communication between the computer device and other devices. The computer device may access a Wireless network based on a communication standard, such as Wi-Fi (Wireless-Fidelity, Wireless network).

It will be appreciated that the configuration shown in FIG. 9 is merely illustrative and that a computer device may include more or fewer components than shown in FIG. 9, or have different components than shown in FIG. 9. Each of the components shown in fig. 9 may be implemented in hardware, software, or a combination thereof.

In an exemplary embodiment, a computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method as described above. The computer-readable storage medium may be included in the electronic device described in the above embodiment, or may exist separately without being incorporated in the electronic device.

It should be noted that the computer readable storage medium shown in the embodiments of the present application may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

The above description is only a preferred exemplary embodiment of the present application, and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make various changes and modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of text classification, comprising:

converting text data to be classified into text vectors to be classified;

inputting the text vector to be classified into a pre-established Faiss index model for similarity index so as to output the text vector in N training sets with the highest similarity with the text vector to be classified, and taking the text data corresponding to the output text vector as an alternative set; n is a positive integer greater than or equal to 2; attaching a class label to the text data in the training set, wherein the class label is used for representing a class to which the text data belongs;

obtaining the feature words of the categories corresponding to the labels of the categories in the training set according to a TF-IDF technology;

and taking the category corresponding to the text data in the alternative set as an alternative category, and determining the category of the text data to be classified according to the alternative category and the feature words corresponding to the alternative category and contained in the text data to be classified.

2. The method of claim 1, wherein prior to the converting the text data to be classified into a text vector to be classified, the method further comprises:

determining a deactivation word bank corresponding to a service scene to which the text data to be classified belongs;

and preprocessing the text data to be classified according to the stop word library so as to filter stop words in the text data to be classified.

3. The method for classifying the text according to claim 2, wherein the converting the text data to be classified into the text vector to be classified comprises:

and inputting the preprocessed text data to be classified into a pre-training model BERT to obtain the text vector to be classified.

4. The method for classifying the texts according to claim 1, wherein before inputting the text vectors to be classified into a pre-established Faiss index model for similarity index, the method comprises:

and establishing the Faiss index model based on the text vectors corresponding to the text data in the training set and the internal clustering index.

5. The method of classifying text according to claim 4, wherein after the building the Faiss index model based on the training set vectors corresponding to the training set and an internal cluster index, the method further comprises:

transmitting the Faiss index model to each actuator on a distributed computing cluster through a broadcast mechanism;

the inputting the text vector to be classified into the Faiss index model for similarity index comprises:

and performing similarity index on the text vector to be classified by using the Faiss index model in the actuator through a calculation engine.

6. The method for classifying texts according to any one of claims 1 to 5, wherein the obtaining the feature words of the classes corresponding to the class labels in the training set according to the TF-IDF technology comprises:

performing word segmentation processing on the text data of each category in the training set to obtain word segmentation results of the text data of each category;

calculating TF-IDF values of all words in the text data of each category according to the word segmentation result and the TF-IDF technology;

and selecting K words with the highest TF-IDF value in each category as feature words of the category, wherein K is a positive integer.

7. The text classification method according to claim 6, wherein the determining the category of the text data to be classified according to the candidate category and the feature word corresponding to the candidate category included in the text data to be classified includes:

carrying out duplicate removal processing on the alternative categories to obtain each target alternative category;

performing word segmentation processing on the text data to be classified, searching whether feature words corresponding to all target alternative categories are contained in obtained word segmentation results, and taking the contained feature words as target feature words;

calculating target TF-IDF values of the text data to be classified under each target alternative category according to the TF-IDF values of the target characteristic words and the occurrence times of the target characteristic words in the text data to be classified;

and taking the target candidate category corresponding to the highest target TF-IDF value as the category of the text data to be classified.

8. A text classification apparatus, comprising:

the vector conversion module is used for converting the text data to be classified into text vectors to be classified;

the data indexing module is used for inputting the text vector to be classified into the preset Faiss indexing model for similarity indexing so as to output the text vector in the N training sets with the highest similarity with the text vector to be classified, and taking the text data corresponding to the output text vector as an alternative set; n is a positive integer greater than or equal to 2; attaching a class label to the text data in the training set, wherein the class label is used for representing a class to which the text data belongs;

the characteristic acquisition module is used for acquiring the characteristic words of the categories corresponding to the various category labels in the training set according to the TF-IDF technology;

and the text classification module is used for taking the category corresponding to the text data in the candidate set as a candidate category and determining the category of the text data to be classified according to the candidate category and the feature words corresponding to the candidate category and contained in the text data to be classified.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to implement the method of any of claims 1-7.

10. A computer-readable storage medium having computer-readable instructions stored thereon, which, when executed by a processor of a computer, cause the computer to perform the method of any one of claims 1-7.