WO2021051560A1

WO2021051560A1 - Text classification method and apparatus, electronic device, and computer non-volatile readable storage medium

Info

Publication number: WO2021051560A1
Application number: PCT/CN2019/117647
Authority: WO
Inventors: 郑立颖; 徐亮; 阮晓雯
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-09-17
Filing date: 2019-11-12
Publication date: 2021-03-25
Also published as: CN110717039B; CN110717039A

Abstract

The present application relates to the technical field of artificial intelligence, and disclosed are a text classification method and apparatus. The method comprises: by means of segmenting words of a text to be classified, obtaining a segmented word set corresponding to the text; vectorizing the segmented word set according to a preset word vector dictionary, and obtaining a word vector set corresponding to the text, the word vector dictionary being integrated with a fast text vector and a word embedded vector corresponding to the segmented words; by means of a preset tag prediction model, predicting a category tag for the word vector set corresponding to the text, the tag predication model being obtained by training according to both a training set and a test set, and the test set being used for correcting error data in the training set; and acquiring a prediction result outputted by the tag prediction model, wherein the prediction result corresponds to the text category that corresponds to the text. The present application is capable of greatly improving the accuracy of text classification.

Description

Text classification method and device, electronic equipment, computer non-volatile readable storage medium

Technical field

This application claims the priority of the Chinese patent application 201910877110.9 filed on September 17, 2019 with the application titled "Text Classification Method and Device, Electronic Equipment, Computer Readable Storage Medium", and the entire contents of which are incorporated herein by reference. .

This application relates to the field of artificial intelligence technology, in particular to a text classification method and device, electronic equipment, and computer non-volatile readable storage media.

Background technique

With the rapid development of network technology, the requirements for effective organization and management of electronic text information and obtaining relevant information quickly and comprehensively are getting higher and higher. As an important research direction of information processing, text classification is a common method to solve text information discovery.

The inventor realizes that text classification is a technology that automatically classifies natural sentences according to a certain classification system or standard and marks corresponding categories. The processing of text classification is roughly divided into the stages of text preprocessing, text feature extraction, and classification model construction. Due to the complicated process of text classification, it is easy to be unable to accurately classify natural sentences due to some common errors.

technical problem

Therefore, how to improve the accuracy of text classification is a technical problem that technicians in related fields need to study continuously.

Technical solutions

In order to solve the above technical problems, the present application provides a text classification method and device, electronic equipment, and computer non-volatile readable storage media.

Among them, the technical solution adopted in this application is:

In one aspect, a text classification method includes: obtaining a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified; performing vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain the The word vector set corresponding to the text to be classified, and the word vector dictionary combines the fast text vector corresponding to the word segmentation and the word embedding vector; the category label prediction is performed on the word vector set corresponding to the text to be classified through a preset label prediction model , The label prediction model is obtained by jointly training according to the training set and the test set, the test set is used to correct the erroneous data in the training set; the prediction result output by the label prediction model is obtained, and the The prediction result corresponds to the text category corresponding to the text to be classified.

On the other hand, a text classification device includes: a word segmentation processor configured to obtain a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified; the vectorization processor is configured to perform word segmentation according to a preset word vector The dictionary performs vectorization processing on the word segmentation set to obtain the word vector set corresponding to the text to be classified. The word vector dictionary fuses the fast text vector and word embedding vector corresponding to the word segmentation; the label predictor is configured to pass pre- It is assumed that the label prediction model performs category label prediction on the word vector set corresponding to the text to be classified, the label prediction model is jointly trained based on the training set and the test set, and the test set is configured to modify the Error data in the training set; category obtainer configured to obtain the prediction result output by the label prediction model, the prediction result corresponding to the text category corresponding to the text to be classified.

In another aspect, an electronic device includes a processor and a memory, and computer-readable instructions are stored on the memory, and the computer-readable instructions implement the text classification method as described above when executed by the processor.

On the other hand, a computer non-volatile readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the text classification method as described above is realized.

Beneficial effect

The technical solutions provided by the embodiments of the present application may include the following beneficial effects:

In the above technical solution, after performing word segmentation processing on the text to be classified to obtain the word segmentation set, first perform vectorization processing on the word segmentation set according to the word vector dictionary to obtain the word vector set corresponding to the text to be classified, and then classify the word vector set through the label prediction model It is predicted that because the word vector dictionary is fused with the fast text vector and word embedding vector corresponding to the word segmentation, it can be fault-tolerant for unregistered words and typos in the classified text, making the process of word segmentation vectorization of the text to be classified more accurate. In addition, Since the label prediction model is jointly trained based on the training set and the test set, compared with the traditional label prediction model that is only trained based on the training set, the training of the label prediction model in this application can compare the error data in the training set based on the test set. Automatic correction to optimize the accuracy of the trained label prediction model. Therefore, based on a more accurate word segmentation vector and label prediction model, the accuracy of text classification can be greatly improved.

Description of the drawings

The drawings here are incorporated into the specification and constitute a part of the specification, show embodiments that conform to the application, and are used together with the specification to explain the principle of the application.

Fig. 1 is a schematic diagram showing an implementation environment involved in this application according to an exemplary embodiment;

Fig. 2 is a hardware block diagram showing a server according to an exemplary embodiment;

Fig. 3 is a flowchart showing a text classification method according to an exemplary embodiment;

Fig. 4 is a flowchart showing a method for text classification according to another exemplary embodiment;

Fig. 5 is a flowchart showing a text classification method according to another exemplary embodiment;

FIG. 6 is a flowchart of step 550 shown in FIG. 5 in an embodiment;

Fig. 7 is a block diagram of a text classification device according to an exemplary embodiment.

Through the above drawings, the specific embodiments of the present application have been shown, and there will be more detailed descriptions in the following. These drawings and text descriptions are not intended to limit the scope of the concept of the present application in any way, but by referring to specific embodiments. The concept of this application is explained to those skilled in the art.

Embodiments of the present invention

Here, an exemplary embodiment will be described in detail, and examples thereof are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present application. On the contrary, they are merely examples of devices and methods consistent with some aspects of the application as detailed in the appended claims.

Fig. 1 is a schematic diagram showing an implementation environment involved in this application according to an exemplary embodiment. As shown in FIG. 1, the implementation environment includes a text acquisition client 100 and a text processing server 200.

Wherein, a wired or wireless network connection is established in advance between the text obtaining client 100 and the text server 200 to realize the interaction between the text obtaining client 100 and the text server 200.

The text obtaining client 100 is used for obtaining text information, and transmitting the obtained text information to the text server 200 for corresponding processing. For example, in the application scenario of a smart interview, the text acquisition client 100 is a smart interview terminal, which is not only used to display the interview questions to the interviewer, but also to obtain the text information input by the interviewer, and when the interviewer's input is voice, pass Intelligently recognize the input voice to convert the input voice into input text.

Exemplarily, the text acquisition client 100 may be an electronic device such as a smart phone, a tablet computer, a notebook computer, a computer, and the like, and the number thereof is not limited (only two are shown in FIG. 1).

The text server 200 is configured to perform corresponding processing on the text information transmitted by the text obtaining client 100 to implement the functions corresponding to the text obtaining client 100. For example, in the above-mentioned smart interview scenario, the text server 200 is used to obtain the text information transmitted by the client 100 according to the text, score the interview performance of the interviewer, and realize the intelligent evaluation of the interview result.

When the text server 200 performs text information processing, it is inevitably required to classify the received text information. Therefore, in the present implementation environment, the text server 200 executes the classification processing of the text to be classified.

Exemplarily, the text server 200 may be a server or a server cluster composed of several servers, which is not limited here.

Fig. 2 is a block diagram of a server according to an exemplary embodiment. The server can be specifically implemented as a text server 200 in the implementation environment shown in FIG. 1.

It should be noted that the server is only an example adapted to this application, and cannot be considered as providing any restriction on the scope of use of this application. The server also cannot be interpreted as needing to rely on or have one or more components in the exemplary server shown in FIG. 2.

The hardware structure of the server may vary greatly due to differences in configuration or performance. As shown in FIG. 7, the server includes: a power supply 210, an interface 230, at least one memory 250, and at least one central processing unit (CPU, Central Processing Units) 270.

Wherein, the power supply 210 is used to provide working voltage for each hardware device on the server. The interface 230 includes at least one wired or wireless network interface 231, at least one serial-to-parallel conversion interface 233, at least one input/output interface 235, at least one USB interface 237, etc., for communicating with external devices. As a carrier for resource storage, the memory 250 can be a read-only memory, a random access memory, a magnetic disk or an optical disc, etc. The resources stored on it include the operating system 251, application programs 253 or data 255, etc. The storage method can be short-term storage or permanent storage. .

Among them, the operating system 251 is used to manage and control various hardware devices and application programs 253 on the server to realize the calculation and processing of the massive data 255 by the central processing unit 270, which can be Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, etc. The application program 253 is a computer program that completes at least one specific task based on the operating system 251. It may include at least one module (not shown in FIG. 2), and each module may include a series of computer programs for the server. Read instructions. The data 255 may be interface metadata stored in a disk or the like.

The central processing unit 270 may include one or more processors, and is configured to communicate with the memory 250 via a bus for computing and processing the massive data 255 in the memory 250.

As described in detail above, a server applicable to the present application will read a series of computer-readable instructions stored in the memory 250 through the central processing unit 270 to complete the text classification method described in the following embodiments.

In addition, this application can also be implemented by hardware circuits or hardware circuits in combination with software instructions. Therefore, implementation of this application is not limited to any specific hardware circuits, software, and combinations of the two.

Fig. 3 is a flowchart showing a text method according to an exemplary embodiment. The method is applicable to the text server 200 in the implementation environment shown in Fig. 1 to realize the classification processing of the input text. As shown in Figure 3, the text classification method includes at least the following steps:

Step 310: Obtain a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified.

As mentioned earlier, text classification is a process of automatically classifying and marking text to be classified according to a certain classification system, and the entire text classification process is automatically executed by computer equipment. In the execution of automatic classification of the text to be classified, the computer equipment cannot handle some common errors, for example, there are unregistered words or typos in the text to be classified, which causes the computer equipment to be unable to accurately understand the meaning of the text to be classified, thus causing the computer equipment to treat the text to be classified The classification accuracy of is not high.

In order to solve this problem, this embodiment provides a text classification method, which can have high fault tolerance for unregistered words and typos in the classified text, thereby improving the accuracy of text classification for the text to be classified.

It should be understood that unregistered words refer to words that cannot be directly found in the trained word vector dictionary in the text to be classified. For example, "knowledge base" is a new word formed in the continuous development of computer technology, which cannot be found directly in ordinary word vector dictionaries.

The word segmentation processing of the text to be classified is implemented by a Chinese word segmentation algorithm to divide the text to be classified into a number of word segments, so as to obtain the word segmentation set corresponding to the text to be classified.

Exemplarily, the Chinese word segmentation algorithm can choose a word segmentation algorithm based on the vocabulary, such as forward maximum matching algorithm (FMM), reverse maximum matching algorithm (BMM), or two-way maximum matching algorithm (BM), or select a word segmentation algorithm based on a statistical model For example, the word segmentation algorithm based on the N-gram language model can also use the word segmentation algorithm based on sequence labeling, such as the end-to-end word segmentation algorithm based on hidden Markov model (HMM), conditional random field (CRF), and deep learning. The Office does not limit the specific types of Chinese word segmentation algorithms.

It should be noted that the word segmentation processing of the text to be classified cannot eliminate the unregistered words and typos in the text to be classified. Therefore, when the text to be classified contains unregistered words or typos, the word segmentation corresponding to the text to be classified The set should also contain unregistered words or typos.

Step 330: Perform vectorization processing on the word segmentation set according to the preset word vector dictionary to obtain the word vector set corresponding to the text to be classified. The word vector dictionary is fused with the fast text vector and the word embedding vector corresponding to the word segmentation.

Among them, the word vector dictionary used in this embodiment is obtained through special training in advance, so that when vectorizing the word set corresponding to the text to be classified according to the word vector dictionary, the unregistered words and typos in the word set can be processed. It is fault-tolerant.

Vectorizing the word segmentation set according to the word vector dictionary means that each word in the word segmentation set is queried from the word vector dictionary for the word vector corresponding to the word segmentation, and the word vector obtained from the query forms the word vector set corresponding to the text to be classified .

The fast text vector fused by the word vector dictionary refers to the vector obtained by vectorizing the word segmentation through the continuous skip metagram mode (ie skip-gram mode) of the fast text model (ie, FastText model). In this embodiment, the subword length parameter (ie, subword) in the continuous skip meta-grammar mode needs to be set to 1-2, so that when the fast text model performs vectorization of word segmentation, the word segmentation is split into 1 word or 2 characters for word vector training.

For unregistered words, in the word vector training through the fast text model, since the unregistered words are split into 1-2 words for word vector training, the corresponding vectors of the split words can be accurately spliced. Get the word vector corresponding to the unregistered word. For example, when performing word vector training on the "knowledge base", disassemble it into "knowledge" and "library" for corresponding training, and concatenate the word vectors obtained by training the two to accurately obtain the word vector corresponding to the "knowledge base". Therefore, in the word vector dictionary obtained by training, the word vector corresponding to the unregistered word can be found accurately, which reflects the fault tolerance for unregistered words.

For typos, after the word segmentation is disassembled, there will be repetitions in the sub-words obtained, and similar vector expressions will be given to the correct sub-words and the wrong sub-words (ie typos), so the word vector dictionary obtained is trained , Can play a corrective role in correcting typos. Correspondingly, the word embedding vector is a vector obtained by vectorizing the word segmentation training through the word embedding model (ie, the word2vec model).

Since the network structure corresponding to the word embedding model contains a hidden layer, for word segmentation with complex text structure, it is necessary to fully consider the word order information between the word segments when performing vectorization training to obtain an accurate word vector. Therefore, use The word embedding model can accurately obtain the word vector corresponding to the word segmentation in some complex sentences.

Therefore, in this embodiment, the fast text model and the word embedding model are used to train the word vector dictionary to vectorize the word vector set corresponding to the text to be classified, which fully guarantees the accuracy of the word vector set corresponding to the text to be classified.

Step 350: Perform category label prediction on the word vector set corresponding to the text to be classified using a preset label prediction model. The label prediction model is obtained by jointly training based on the training set and the test set.

Among them, the label prediction model that performs category label prediction on the word vector set corresponding to the text to be classified is also obtained through a special training method, so that the prediction model can accurately perform label prediction on the word vector set corresponding to the input text to be classified.

In ordinary label prediction model training, the training set is a data set containing a large number of training samples. These training samples are used to train the label model to obtain a qualified label prediction model. The test set is a data set containing a large number of test samples. These test samples are used to test the trained label prediction model and do not participate in the process of model training.

In this embodiment, both the training set and the test set are used to train the label prediction model. Specifically, in the training of the label prediction model, because the wrong data in the training set will affect the accuracy of the trained label prediction model, Therefore, in the training of the label prediction model, the wrong data in the training set is automatically corrected through the test set, and then the corrected training set is used to perform the training of the label prediction model, thereby greatly optimizing the training process of the label prediction model , In order to train to obtain a more accurate label prediction model. Exemplarily, the error data in the training set includes the category label error of the training specimen.

It should be noted that the specific type of the label prediction model is not limited in this embodiment. In the training of the label prediction model, the initial label prediction model can be adaptively selected according to specific application scenarios. Exemplarily, when the amount of data to be trained is lower than the set threshold, a traditional machine learning model can be selected as the initial label prediction model for training, such as SVM (Support Vector Machine) model; if the data to be trained If the amount of data exceeds the set threshold, a deep learning model can be selected as the initial label prediction model to be trained, such as CNN (Convolutional neural network, convolutional neural network) model or LSTM (Long Short-Term Memory, long and short-term memory network) model.

Step 370: Obtain a prediction result output by the label prediction model, where the prediction result corresponds to the text category corresponding to the text to be classified.

Among them, the prediction result output by the label prediction model includes several text categories that the text to be classified may correspond to, and the probability value corresponding to each text category, and the probability value is used to indicate the possibility of the text to be classified corresponding to the text category.

Therefore, the method provided by this embodiment can adequately deal with the problem of unregistered words and wrong words in the classified text, and the problem of incorrect data in the training set causing the inaccuracy of the trained label prediction model. Therefore, it can deal with the problem of classification. The text category corresponding to the text is accurately predicted.

Fig. 4 is a flowchart of a text classification method according to another exemplary embodiment. As shown in FIG. 4, before step 310, the text classification method further includes the following steps:

Step 410: Obtain the word segmentation lexicon of the corpus to be trained on word vectors.

Among them, the corpus word segmentation database contains a large number of accurate pre-accurate word segmentation sets. Through word vector training for each word segment contained in the expected word segmentation thesaurus, the word vector corresponding to the word segmentation is obtained, which is formed by these word segments and the word vector corresponding to the word segmentation. Word vector dictionary.

It should be noted that, for different application scenarios, the sources for obtaining the expected word segmentation lexicon correspond to different correspondences. Exemplarily, in the aforementioned smart interview application scenario, the expected word segmentation database can be obtained by word segmentation processing on some interview strategies and interview questions on the Internet, or it can be performed on the corpus data directly provided by the interview business party. Word segmentation processing income.

Step 430: For each word segmentation in the expected word segmentation thesaurus, word vector training is performed through the continuous skip metagram mode of the fast text model and the word embedding model to obtain the fast text vector and the word embedding vector corresponding to the word segmentation.

As mentioned earlier, when performing word vector training for each word segmentation in the expected word segmentation lexicon through the continuous skip metagram mode of the fast text model, the subword length parameter (namely subword) in the continuous skip metagram mode needs to be changed from the default The value of 3-6 is modified to 1-2, so that the word vector dictionary trained by this embodiment can treat unregistered words and typos in the classified text with fault tolerance.

It should be noted that, for the word segmentation in the corpus, if the word vector training is performed according to the set sub-word length parameter 1-2 to obtain multiple word vectors, the sub-words will be divided into sub-words in the order in which the word segmentation is split into sub-words. The word vector corresponding to the word is spliced to obtain the word vector corresponding to the word segmentation.

By using the word embedding model to train each word segment in the expected word segmentation lexicon, the word order information between the word segments can be considered to obtain an accurate word vector.

That is to say, by performing word vector training on each word segmentation in the corpus word segmentation lexicon according to the method provided in this embodiment, a corresponding fast text vector and a word embedding vector can be obtained.

Step 450: By calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation, the average vector is obtained as a vector expression corresponding to the word segmentation.

Among them, in order to enable the word vector corresponding to each word segmentation in the word vector dictionary to accurately express the corresponding word segmentation, it is necessary to fuse the word vector with the fast text vector and word embedding vector obtained through step 430.

In this embodiment, fusing the fast text vector and the word embedding vector into the word vector corresponding to the word segmentation refers to adding the fast text vector corresponding to the word segmentation and the word embedding vector, and then averaging the sum of the resulting vectors The calculation is enough, and the result of the calculation is the vector expression corresponding to the word segmentation, and the vector expression is the word vector corresponding to the word segmentation in the word vector dictionary.

Step 470: Obtain a vector expression corresponding to each word segment in the corpus word segmentation dictionary to form a word vector dictionary.

Through the process described in step 430 and step 450, the vector expression corresponding to each participle in the corpus word segmentation dictionary can be obtained. Therefore, each participle in the corpus word segmentation dictionary and the vector expression corresponding to each participle form a word vector dictionary.

As mentioned earlier, when performing vectorization processing on the word segmentation set corresponding to the text to be classified, the word vector dictionary trained according to this embodiment can accurately query the word vector corresponding to each word in the word segmentation set, and accurately obtain the text to be classified The corresponding word vector collection.

Fig. 5 is a flowchart of a text classification method according to another exemplary embodiment. As shown in FIG. 5, before step 310, the text classification method further includes the following steps:

Step 510: According to a set ratio, the annotated corpus to be trained for the label prediction model is divided into a training set and a test set, and the annotated corpus contains the annotated category labels.

Among them, the annotation expectation is a collection of texts marked with category labels for indicators, and the text marked with category labels is also called a sample.

The labeling expectation also corresponds to the corpus of word segmentation obtained in step 410. Illustratively, in the application scenario described in step 410, the labeling expectation includes not only some interview strategies and interview questions on the Internet, but also direct interviews by the business party. The provided corpus data, through the word segmentation processing of the labeled corpus, can obtain the corresponding corpus word segmentation thesaurus.

The ratio of dividing annotated corpus into training set and test set is preset. For example, the ratio of dividing into training set and test set can be 7:3, and the ratio value is not limited here. However, it should be noted that in general, the proportion of the training set should be greater than the proportion of the test set, and a training set with a larger amount of data is more helpful to obtain an accurate label prediction model.

Step 530: Perform initial training on the label prediction model to be trained based on the training set.

As mentioned earlier, in different application scenarios, the label prediction model for initial training can be specifically selected. For example, when the amount of data in the training set is lower than the set threshold, the SVM model can be used for initial training; if the amount of data in the training set exceeds the set threshold, the CNN model or the LSTM model can be used for initial training.

It should be noted that the initial training of the label prediction model to be trained based on the training set is to obtain an initial label prediction model, but because the category labels labeled by the training samples in the training set may have errors, the training set is used to perform initial training. The category label prediction performed by the label prediction model obtained by the initial training may have a prediction bias.

Therefore, it is necessary to automatically correct the incorrectly labeled category labels in the training set, and then iteratively train the label prediction model according to the corrected training set, so as to train to obtain a label prediction model with higher accuracy.

Step 550: Perform combined training on the label prediction model obtained in the initial training through the training set and the test set, and correct the incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model.

Among them, after the initial label prediction model is obtained through initial training, combined training is performed through the initial label prediction model training set and test set. What needs to be understood is that the combined training process refers to inputting the training set and the test set into the initial label prediction model in turn, and the label prediction model is used to perform label prediction on each training sample in the training set. Collect each test sample to perform label prediction and output the prediction result.

Since the training set and the test set are divided from the labeling expectations, each training sample and test sample are pre-labeled with the corresponding class label of the sample, and the prediction result output by the label prediction model is compared with the pre-labeled class label of the sample. The accuracy of label prediction for the training set and the test set can be obtained separately by the label prediction model.

It should be understood that the accuracy rate corresponding to the training set refers to the ratio of the number of training samples whose prediction results output by the label prediction model are the same as the pre-labeled category labels to the total number of training samples. The accuracy rate corresponding to the test set is the same, so I won't repeat it here.

According to the respective accuracy rates of the training set and the test set, the prediction effect of the label prediction model obtained from the initial training can be obtained. For example, if the accuracy rate corresponding to the training set is higher than 90%, and the accuracy rate corresponding to the test set is higher than 85%, it means that the label prediction model obtained by the initial training has a better prediction effect, otherwise it means that the current label prediction model cannot achieve better results Forecast effect.

As mentioned above, the reason for the poor performance of the label prediction model obtained in the initial training may be that there are errors in the pre-labeled category labels of the training samples in the training set. Therefore, it is necessary to correct the incorrectly labeled category labels in the training set to obtain the correct training set. .

Step 570: Update the training set according to the corrected category labels, and iteratively execute the training process of the label prediction model through the test set and the training set obtained by the update, until the label prediction model converges.

Among them, the iterative execution of the training process of the label prediction model through the training set obtained by the test set and the update means that after the updated training set is obtained, the contents described in step 530 and step 550 are repeatedly executed, that is, first based on the updated training set. The training set retrains the label prediction model obtained from the initial training, and then performs combined training on the training label prediction model according to the test set and the updated training set, and judges the prediction effect of the current label prediction model. If the effect is not good, continue to execute Correction of the wrong category labels in the training set and retraining of the label prediction model until the label prediction model converges. It should be understood that the label prediction model convergence means that the set prediction accuracy can be achieved in the category prediction performed by the label prediction model.

Therefore, according to the method provided in this embodiment, a label prediction model with higher prediction accuracy can be trained. In actual application scenarios, the label prediction model predicts the set of word vectors corresponding to the text to be classified, and can obtain accurate predictions. result.

FIG. 6 is a flowchart of step 550 shown in FIG. 5 in an exemplary embodiment. As shown in Figure 5, the process of correcting the incorrectly labeled category labels in the training set according to the prediction results output by the label prediction model specifically includes the following steps:

Step 551: According to the output result of the label prediction model, respectively calculate the accuracy of label prediction for the training set and the test set by the label prediction model.

As mentioned above, the accuracy of label prediction by the label prediction model for the training set refers to the ratio of the number of training samples whose output prediction result of the label prediction model is the same as the pre-labeled category label to the total number of training samples. Thus, by obtaining the number of training samples whose prediction results output by the label prediction model are the same as the pre-labeled category labels, and then calculating the ratio of the number of training samples to the total number of training samples contained in the training sample set, the corresponding accuracy rate can be obtained.

The label prediction model is the same for the accuracy of label prediction for the test set, and will not be repeated here.

In step 553, when the accuracy rates corresponding to the training set and the test set are both lower than the set accuracy threshold, select the training sample set in which the prediction result in the training set is inconsistent with the labeled category label.

Among them, the accuracy thresholds set for the accuracy rates corresponding to the training set and the test set may be the same or different. Generally speaking, since the current label prediction model is obtained through initial training through the training set, the label prediction model has a higher accuracy rate for the training set prediction, so the corresponding accuracy threshold should also be larger.

The set accuracy threshold can be determined in combination with samples marked with category labels. Exemplarily, for the prediction result output by the current label prediction model for the training set, by summarizing the probability values corresponding to all correctly predicted category labels (the probability value is directly output by the label prediction model), the probability value set is obtained, and the Probability value collection for statistical analysis. In an embodiment, the process of performing statistical analysis on the probability value set is to find the probability value corresponding to the 50% quantile value in the probability value set, and obtain this probability value as the accuracy threshold.

Step 555: Obtain the predicted probability value corresponding to the training sample set by calculating the probability that the prediction result in the training sample set is correct and the category label is incorrectly labeled.

Among them, the predicted probability value corresponding to the training sample set indicates the probability that the corresponding training sample may have a wrong labeling of the category label. When the predicted probability value is higher than the set probability threshold, it means that the probability of the training sample having a wrong labeling of the category label is high. Jump to step 557. When the predicted probability value is lower than the set probability threshold, it indicates that the probability of the training sample being incorrectly labeled with the category label is small, and step 559 is skipped to.

Step 557: Correct the category label of the training sample in the training sample set to correspond to the prediction result output by the label prediction model.

Step 559: Obtain manually input category labels to correct the category labels of training samples in the training sample set.

Among them, when the probability that the training sample is incorrectly labeled with the category label is small, it is necessary to determine whether the category label of the training sample in the training sample set is correct in combination with manual experience, and correct the training sample with the wrong category label. By obtaining the correct category label manually input, and replacing the correct category label with the wrong category label of the training sample, the correction of the category label amount of the training sample in the training sample set can be realized.

Through the method provided in this embodiment, the automatic correction of the incorrectly labeled category labels in the training samples is realized, thereby obtaining an accurate label prediction model.

Fig. 7 is a block diagram showing a text classification device according to an exemplary embodiment. As shown in FIG. 7, the device includes a word segmentation processor 610, a vectorization processor 630, a label predictor 650, and a category obtainer 670.

The word segmentation processor 610 is configured to obtain a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified. The vectorization processor 630 is configured to perform vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain a word vector set corresponding to the text to be classified. The word vector dictionary integrates the fast text vector and the word embedding vector corresponding to the word segmentation. The label predictor 650 is configured to perform category label prediction on the set of word vectors corresponding to the text to be classified through a preset label prediction model. The label prediction model is jointly trained based on the training set and the test set, and the test set is configured as a corrector. Describe the wrong data in the training set. The category obtainer 670 is configured to obtain the prediction result output by the label prediction model, and the prediction result corresponds to the text category corresponding to the text to be classified.

In an exemplary embodiment, the text classification device further includes a corpus word segmentation vocabulary obtainer, a word vector trainer, a vector expression fusion device, and a word vector dictionary obtainer (not shown in FIG. 7). The corpus word segmentation vocabulary acquirer is configured to obtain the corpus word segmentation vocabulary for which word vector training is to be performed. The word vector trainer is configured to perform word vector training on each word segmentation in the corpus of word segmentation of the corpus, respectively through the continuous jump metagram mode of the fast text model and the word embedding model to obtain the fast text vector and word embedding vector corresponding to the word segmentation. The vector expression fusion device is configured to obtain the average vector as the vector expression corresponding to the word segmentation by calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation. The word vector dictionary acquirer is configured to acquire the vector expression corresponding to each word segment in the corpus word segmentation dictionary to form a word vector dictionary.

In an exemplary embodiment, the text classification device further includes an annotated corpus allocator, a model initial trainer, a category label corrector, and a model iterative trainer. The labeled corpus distributor is configured to divide the labeled corpus to be trained for the label prediction model into a training set and a test set according to a set ratio, and the labeled corpus contains the labeled category labels. The model initial trainer is configured to perform initial training on the label prediction model to be trained according to the training set. The category label corrector is configured to perform combined training on the label prediction model obtained from the initial training through the training set and the test set respectively, and correct the incorrectly labeled category labels in the training set according to the prediction results output by the label prediction model. The model iterative trainer is configured to update the training set according to the corrected category labels, and iteratively execute the training process of the label prediction model through the test set and the updated training set until the label prediction model converges.

It should be noted that the device provided in the foregoing embodiment and the method provided in the foregoing embodiment belong to the same concept, and the specific manners for each device to perform operations have been described in detail in the method embodiment, and will not be repeated here.

In an exemplary embodiment, the present application further provides an electronic device, the electronic device includes: a processor; a memory, the memory is stored with computer readable instructions, when the computer readable instructions are executed by the processor, The text classification method as described earlier.

In an exemplary embodiment, the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the text classification method as described above is realized.

It should be understood that the present application is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be performed without departing from its scope. The scope of the application is only limited by the appended claims.

Claims

A text classification method, including:

Obtaining a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified;

Performing vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain a word vector set corresponding to the text to be classified, and the word vector dictionary fuses the fast text vector and the word embedding vector corresponding to the word segmentation;

Perform category label prediction on the word vector set corresponding to the text to be classified through a preset label prediction model, the label prediction model is jointly trained based on the training set and the test set, and the test set is used for correction Erroneous data in the training set;

Obtain the prediction result output by the label prediction model, the prediction result corresponding to the text category corresponding to the text to be classified.
The method according to claim 1, wherein, before the word segmentation processing is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the method further comprises:

Obtain the word-segmentation lexicon of the corpus for word vector training;

For each word segmentation in the corpus word segmentation thesaurus, word vector training is performed through the continuous skip metagrammatic mode of the fast text model and the word embedding model to obtain the fast text vector and the word embedding vector corresponding to the word segmentation;

By calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation, obtaining the average vector as the vector expression corresponding to the word segmentation;

The vector expression corresponding to each word segment in the corpus word segmentation dictionary is obtained to form the word vector dictionary.
3. The method according to claim 2, wherein the sub-word length parameter in the continuous skip meta-grammar mode is used to indicate that the word segmentation is split into one character or two characters for the word vector training.
The method according to claim 1, wherein, before the word segmentation processing is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the method further comprises:

According to a set ratio, the labeled corpus to be trained for the label prediction model is divided into a training set and a test set, and the labeled corpus contains the labeled category labels;

Performing initial training on the label prediction model to be trained according to the training set;

Perform combined training on the label prediction model obtained from the initial training through the training set and the test set, and correct the incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model;

The training set is updated according to the corrected category labels, and the training process of the label prediction model is performed iteratively through the test set and the updated training set until the label prediction model converges.
The method of claim 4, wherein the correcting the incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model comprises:

According to the output result of the label prediction model, respectively calculate the accuracy of label prediction by the label prediction model for the training set and the test set;

When the accuracy rates corresponding to the training set and the test set are both lower than the set accuracy threshold, screening the training sample sets in which the predicted label result in the training set is inconsistent with the labeled category label;

Obtaining the prediction probability value corresponding to the training sample set by calculating the probability that the prediction result in the training sample set is correct and the category label is incorrectly labeled;

When the predicted probability value is lower than the set probability threshold, the manually input category label is obtained to correct the category label marked by the training sample in the training sample set.
A text classification device includes:

The word segmentation processor is configured to obtain a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified;

The vectorization processor is configured to perform vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain a word vector set corresponding to the text to be classified, and the word vector dictionary is fused with fast text vectors corresponding to the word segmentation And word embedding vector;

The label predictor is configured to perform category label prediction on the set of word vectors corresponding to the text to be classified through a preset label prediction model, the label prediction model being jointly trained based on the training set and the test set, so The test set is configured to correct incorrect data in the training set;

The category obtainer is configured to obtain the prediction result output by the label prediction model, the prediction result corresponding to the text category corresponding to the text to be classified.
The device of claim 6, wherein the device further comprises:

The corpus word segmentation vocabulary acquirer is configured to obtain the corpus word segmentation vocabulary for word vector training;

The word vector trainer is configured to train each word in the word segmentation lexicon of the corpus through the continuous jump metagrammatic mode of the fast text model and the word embedding model to perform word vector training to obtain the fast text vector and word corresponding to the word segmentation Embedding vector

The vector expression fusion device is configured to obtain the average vector as the vector expression corresponding to the word segmentation by calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation;

The word vector dictionary acquirer is configured to acquire the vector expression corresponding to each word segment in the corpus word segmentation dictionary to form the word vector dictionary.
8. The device of claim 7, wherein the sub-word length parameter in the continuous skip metagram mode is configured to indicate that the word segmentation is split into 1 character or 2 characters for the word vector training.
The device of claim 6, wherein the device further comprises:

An annotated corpus distributor, configured to divide an annotated corpus to be trained for a label prediction model into a training set and a test set according to a set ratio, the annotated corpus contains annotated category labels;

The model initial trainer is configured to perform initial training on the label prediction model to be trained according to the training set;

The category label corrector is configured to perform combined training on the label prediction model obtained from the initial training through the training set and the test set respectively, and correct incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model ；

The model iterative trainer is configured to update the training set according to the corrected category labels, and iteratively execute the training process of the label prediction model through the test set and the updated training set until the label prediction model converges.
9. The apparatus of claim 9, wherein the category label modifier comprises:

An accuracy calculator configured to calculate the accuracy of label prediction performed by the label prediction model for the training set and the test set according to the output result of the label prediction model;

The sample filter is configured to filter the training sample set in which the predicted label result in the training set is inconsistent with the labeled category label when the accuracy rates corresponding to the training set and the test set are both lower than the set accuracy threshold;

A prediction probability obtainer configured to obtain a prediction probability value corresponding to the training sample set by calculating the probability that the prediction result in the training sample set is correct and the category label is incorrectly labeled;

The label modifier, when the predicted probability value is lower than the set probability threshold, obtains the manually input category label to correct the category label marked by the training sample in the training sample set.
An electronic device including:

processor;

And a memory on which computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, the processor is configured to implement the following steps:

Obtaining a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified;

Performing vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain a word vector set corresponding to the text to be classified, and the word vector dictionary fuses the fast text vector and the word embedding vector corresponding to the word segmentation;

The category label prediction is performed on the word vector set corresponding to the text to be classified through a preset label prediction model, the label prediction model is jointly trained according to the training set and the test set, and the test set is configured to modify Erroneous data in the training set;

Obtain the prediction result output by the label prediction model, the prediction result corresponding to the text category corresponding to the text to be classified.
11. The electronic device according to claim 11, wherein, before the word segmentation process is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the processor is configured to implement the following steps:

Obtain the word-segmentation lexicon of the corpus for word vector training;

For each word segmentation in the corpus word segmentation thesaurus, word vector training is performed through the continuous skip metagrammatic mode of the fast text model and the word embedding model to obtain the fast text vector and the word embedding vector corresponding to the word segmentation;

By calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation, obtaining the average vector as the vector expression corresponding to the word segmentation;

The vector expression corresponding to each word segment in the corpus word segmentation dictionary is obtained to form the word vector dictionary.
The electronic device according to claim 12, wherein the sub-word length parameter in the continuous skip meta-grammar mode is configured to indicate that the word segmentation is split into 1 character or 2 characters for the word vector training.
11. The electronic device according to claim 11, wherein, before the word segmentation process is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the processor is configured to implement the following steps:

According to a set ratio, the labeled corpus to be trained for the label prediction model is divided into a training set and a test set, and the labeled corpus contains the labeled category labels;

Performing initial training on the label prediction model to be trained according to the training set;

Perform combined training on the label prediction model obtained from the initial training through the training set and the test set, and correct the incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model;

The training set is updated according to the corrected category labels, and the training process of the label prediction model is performed iteratively through the test set and the updated training set until the label prediction model converges.
The electronic device according to claim 14, wherein the incorrectly labeled category label in the training set is corrected according to the prediction result output by the label prediction model, and the processor is configured to implement the following steps:

According to the output result of the label prediction model, respectively calculate the accuracy of label prediction by the label prediction model for the training set and the test set;

When the accuracy rates corresponding to the training set and the test set are both lower than the set accuracy threshold, screening the training sample sets in which the predicted label result in the training set is inconsistent with the labeled category label;

Obtaining the prediction probability value corresponding to the training sample set by calculating the probability that the prediction result in the training sample set is correct and the category label is incorrectly labeled;

When the predicted probability value is lower than the set probability threshold, the manually input category label is obtained to correct the category label marked by the training sample in the training sample set.
A computer non-volatile readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the processor is configured to implement the following steps:

Obtaining a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified;

Performing vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain a word vector set corresponding to the text to be classified, and the word vector dictionary fuses the fast text vector and the word embedding vector corresponding to the word segmentation;

The category label prediction is performed on the word vector set corresponding to the text to be classified through a preset label prediction model, the label prediction model is jointly trained according to the training set and the test set, and the test set is configured to modify Erroneous data in the training set;

Obtain the prediction result output by the label prediction model, the prediction result corresponding to the text category corresponding to the text to be classified.
The computer non-volatile readable storage medium according to claim 16, wherein, before the word segmentation processing is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the processor is configured to implement the following steps :

Obtain the word-segmentation lexicon of the corpus for word vector training;

For each word segmentation in the corpus word segmentation thesaurus, word vector training is performed through the continuous skip metagrammatic mode of the fast text model and the word embedding model to obtain the fast text vector and the word embedding vector corresponding to the word segmentation;

By calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation, obtaining the average vector as the vector expression corresponding to the word segmentation;

The vector expression corresponding to each word segment in the corpus word segmentation dictionary is obtained to form the word vector dictionary.
The computer non-volatile readable storage medium according to claim 17, wherein the sub-word length parameter in the continuous skip metagram mode is configured to indicate that the word segmentation is split into 1 word or 2 words. The word vector training.
The computer non-volatile readable storage medium according to claim 16, wherein, before the word segmentation processing is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the processor is configured to implement the following steps :

According to a set ratio, the labeled corpus to be trained for the label prediction model is divided into a training set and a test set, and the labeled corpus contains the labeled category labels;

Performing initial training on the label prediction model to be trained according to the training set;

Perform combined training on the label prediction model obtained from the initial training through the training set and the test set, and correct the incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model;

The training set is updated according to the corrected category labels, and the training process of the label prediction model is performed iteratively through the test set and the updated training set until the label prediction model converges.
The computer non-volatile readable storage medium according to claim 19, wherein the incorrectly labeled category label in the training set is corrected according to the prediction result output by the label prediction model, and the processor is configured to Implement the following steps:

According to the output result of the label prediction model, respectively calculate the accuracy of label prediction by the label prediction model for the training set and the test set;

When the accuracy rates corresponding to the training set and the test set are both lower than the set accuracy threshold, screening the training sample sets in which the predicted label result in the training set is inconsistent with the labeled category label;

Obtaining the prediction probability value corresponding to the training sample set by calculating the probability that the prediction result in the training sample set is correct and the category label is incorrectly labeled;

When the predicted probability value is lower than the set probability threshold, the manually input category label is obtained to correct the category label marked by the training sample in the training sample set.