WO2021057133A1

WO2021057133A1 - Method for training document classification model, and related apparatus

Info

Publication number: WO2021057133A1
Application number: PCT/CN2020/097869
Authority: WO
Inventors: 任卓
Original assignee: 北京国双科技有限公司
Priority date: 2019-09-24
Filing date: 2020-06-24
Publication date: 2021-04-01
Also published as: CN112632269A

Abstract

Disclosed are a method for training a document classification model, and a related apparatus. The method comprises: on the basis of the context of a word in a document, a vector of the word, and an identifier of the document, obtaining a feature vector of the document by using an unsupervised learning algorithm; and taking documents labeled with classification tags as training documents, and on the basis of feature vectors and classification tags of a plurality of training documents, obtaining a document classification model by means of training with a dichotomy algorithm, wherein the classification tags are target category tags or non-target category tags. It can be seen that a feature vector of a document is extracted on the basis of the unsupervised algorithm by taking the context of the word in the document and the identifier of the document as an input, and the vector of the word as an output, and by taking the correlation between the context of the word and the context in the same document into account, the universality of the feature vector of the document is improved, such that the actual classification effect of a document classification model obtained through training with regard to documents which are not labeled with classification tags is better, thereby improving the classification accuracy of the document classification model.

Description

Method and related device for document classification model training

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 24, 2019, the application number is 201910907014.4, and the invention title is "a method and related device for document classification model training", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of data processing technology, and in particular to a method and related device for training a document classification model.

Background technique

With the rapid development of knowledge engineering and the advancement of digitization in the oil and gas industry, so far the accumulation of knowledge has formed a large number of oil and gas field documents, and the full and efficient use of oil and gas field documents has gradually become the focus of digital oilfield construction.

To achieve full and efficient use of documents in the oil and gas field, it is necessary to meet the requirements of supporting rapid acquisition of professional knowledge query, and to meet the application requirements of knowledge retrieval, knowledge question and answer, and information extraction. These all rely on the classification and management of documents in the oil and gas field, that is, experts in the field are required. Under the established professional labeling system, a large number of oil and gas field documents are marked with reasonable category labels, such as exploration, development, drilling, logging, construction, and many other category labels.

At present, domain document classification usually uses classic methods such as bag-of-words model and TF-IDF weight calculation to extract the classification features of domain documents. This classification feature often focuses on the word frequency of the words in the domain document and ignores the order and context of the words. It has the characteristics of versatility, and the training document classification model is prone to over-fitting phenomenon; the document classification model obtained by training has a poor actual classification effect on the domain documents that are not labeled with the classification label. That is to say, the current classification feature extraction method of domain documents leads to low classification accuracy of the document classification model obtained by training.

Summary of the invention

In view of the above problems, the present application provides a method and related device for document classification model training, so that the document classification model obtained by training has a better actual classification effect on documents with unmarked classification tags, thereby improving the classification accuracy of the document classification model.

In the first aspect, an embodiment of the present application provides a method for document classification model training, which includes:

Based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, learning to obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm;

Based on the feature vectors and classification labels of the multiple training documents, a two-classification algorithm is used to train to obtain a document classification model, and the classification labels are target category labels or non-target category labels.

Optionally, the learning to obtain the feature vectors of the multiple training documents based on the context of the words in the multiple training documents, the vectors of the words, and the identifications of the multiple training documents by using an unsupervised learning algorithm includes:

Based on the context of each word in each training document, the vector of each word, and the identification corresponding to the training document, an unsupervised learning algorithm is used to learn to obtain each of the training documents. Feature vector of words;

The feature vector of each of the words in each of the training documents is merged to obtain the feature vector of each of the training documents.

Optionally, based on the context of each word in each training document, the vector of each word, and the identification corresponding to the training document, each training document is learned by using an unsupervised learning algorithm. The feature vector of each word in the document includes:

Use a word segmentation tool to segment each of the training documents to obtain each of the words in each of the training documents;

For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn from each training document. The feature vector of each of the words.

Optionally, for each of the training documents, the context of each word and the identification of the training document are used as input, the vector of each word is output, and the unsupervised learning algorithm is used to learn to obtain each The feature vector of each of the words in the training documents includes:

For each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, training an initial neural network model by using an unsupervised learning algorithm;

Based on the model parameters after the initial neural network model training, a feature vector of each word in each training document is obtained.

Optionally, the dimension of the feature vector is less than the total number of multiple training documents.

Optionally, after the two-classification algorithm is used to train to obtain the document classification model, the method further includes:

Taking a plurality of documents that are not marked with the classification label as an unmarked document, and predicting and obtaining the predicted classification labels and predicted classification probabilities of the plurality of unmarked documents based on the unmarked document and a preset classification model, the preset The classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;

Screening the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents;

Based on the feature vectors and predicted classification labels of the multiple new training documents, the document classification model is retrained by using a binary classification algorithm.

Optionally, it also includes:

Adjusting the ratio of the target category label to the non-target category label in the multiple training documents or the new training document based on a preset ratio of positive and negative samples.

In a second aspect, an embodiment of the present application provides a method for document classification, using a document classification model trained by the method for training a document classification model according to any one of the above-mentioned first aspects, and the method includes:

Learning to obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word and the identification of the document to be classified, using an unsupervised learning algorithm;

The feature vector of the document to be classified is input into the document classification model for document classification.

In a third aspect, an embodiment of the present application provides a device for training a document classification model, and the device includes:

The first learning obtaining unit is configured to learn and obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm based on the context of the words in the multiple training documents, the vectors of the words, and the identifications of the multiple training documents;

The training obtaining unit is configured to train and obtain a document classification model by using a two-classification algorithm based on the feature vectors and classification labels of a plurality of the training documents, and the classification labels are target category labels or non-target category labels.

In a fourth aspect, an embodiment of the present application provides an apparatus for document classification, and a document classification model trained by the method for training a document classification model according to any one of the above-mentioned first aspects, the apparatus includes:

The second learning and obtaining unit is configured to use an unsupervised learning algorithm to learn and obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word, and the identification of the document to be classified;

The document classification unit is used to input the feature vector of the document to be classified into the document classification model for document classification.

Compared with the prior art, this application has at least the following advantages:

Using the technical solutions of the embodiments of the present application, first, based on the context of the words in the multiple training documents, the word vectors, and the identifications of the multiple training documents, an unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; The feature vector and classification label of each training document are trained using a binary classification algorithm to obtain a document classification model; where the classification label is a target category label or a non-target category label. It can be seen that the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output. The feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document. The correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by the training has a better actual classification effect on the documents with unlabeled classification tags, thereby improving the classification accuracy of the document classification model.

The above description is only an overview of the technical solution of this application. In order to understand the technical means of this application more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other purposes, features and advantages of this application more obvious and understandable. , The following specifically cite the specific implementation of this application.

Description of the drawings

By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those of ordinary skill in the art. The drawings are only used for the purpose of illustrating the preferred embodiments, and are not considered as a limitation to the application. Also, throughout the drawings, the same reference symbols are used to denote the same components. In the attached picture:

FIG. 1 shows a schematic diagram of a system framework involved in an application scenario in an embodiment of the present application;

Fig. 2 shows a schematic flowchart of a method for training a document classification model provided by an embodiment of the present application;

FIG. 3 shows a schematic flowchart of another method for training a document classification model provided by an embodiment of the present application;

FIG. 4 shows a schematic flowchart of a method for document classification according to an embodiment of the present application;

FIG. 5 shows a schematic structural diagram of a device for training a document classification model provided by an embodiment of the present application;

Fig. 6 shows a schematic structural diagram of a document classification device provided by an embodiment of the present application.

detailed description

Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the drawings show exemplary embodiments of the present disclosure, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

With the advancement of digitization in the oil and gas industry, in order to achieve full and efficient use of the accumulated massive oil and gas field documents, it is necessary to mark reasonable category labels for each oil and gas field based on a professional labeling system, such as exploration, development, and drilling. , Logging, construction and other category labels. At this stage, focus is on the word frequency of words in domain documents, and classical methods such as bag-of-words model and TF-IDF weight calculation are generally used to extract classification features of domain documents. However, the inventor found through research that these methods ignore the order and context of the words in the domain document, resulting in the classification features of the domain document not being universal, making the document classification model obtained by training prone to overfitting. The actual classification effect of the domain documents without the classification label is poor. That is, the current classification feature extraction method of domain documents leads to low classification accuracy of the document classification model obtained by training.

The classification of documents has great application value in the fields of document query, document cluster management and document recommendation. Determining document classification is an upstream task in these areas. It provides data support for downstream document processing tasks. If the document classification is not accurate, it will further affect the subsequent document processing effect.

In order to solve this problem, in the embodiment of the present application, first, based on the context of the words in the multiple training documents, the word vectors, and the identification of the multiple training documents, an unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; Then, based on the feature vectors and classification labels of multiple training documents, a binary classification algorithm is used to train to obtain a document classification model; where the classification labels are target category labels or non-target category labels. It can be seen that the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output. The feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document. The correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by the training has a better actual classification effect on the documents with unlabeled classification tags, thereby improving the classification accuracy of the document classification model. For example, in the field of document recommendation, users can recommend other documents belonging to the same category after reading a certain document. The solution of the present invention can accurately classify the massive documents existing on the Internet in advance, find the target documents with the same classification as the documents read by the user among the massive documents, recommend the target documents to the users, and make the pushed documents more accurately conform to the users Preferences. For another example, the present invention can also cluster and manage papers on the paper website. There are many application fields of the present invention. While accurately classifying documents in these fields, the present invention can also avoid the problem of manpower and material resources due to manual extraction of keywords.

For example, one of the scenarios in the embodiment of the present application can be applied to the scenario shown in FIG. 1. The scenario includes a terminal 101 and a processor 102, where the terminal 101 may be a PC or other Mobile terminals, such as mobile phones or tablets, etc. The user determines through the terminal 101 that multiple training documents are sent to the processor 102; the processor 102 uses the first step in the embodiment of the present application to obtain the feature vectors of the multiple training documents; the processor 102 uses the implementation of the embodiment of the present application In the second step, the document classification model is obtained.

It can be understood that, in the above application scenarios, although the actions in the embodiments of the present application are described as being executed by the processor 102, these actions may also be executed by the terminal 101, or partly executed by the terminal 101 and partly executed by the processor 102. carried out. This application is not limited in terms of the execution subject, as long as the actions disclosed in the embodiments of this application are executed.

It is understandable that the foregoing scenario is only an example of a scenario provided in an embodiment of the present application, and the embodiment of the present application is not limited to this scenario.

The method for training the document classification model and the specific implementation of related devices in the embodiments of the present application will be described in detail below with reference to the accompanying drawings and embodiments.

Exemplary method

Refer to FIG. 2, which shows a schematic flowchart of a method for training a document classification model in an embodiment of the present application. In this embodiment, the method may include the following steps, for example:

Step 201: Based on the context of the words in the multiple training documents, the vectors of the words and the identities of the multiple training documents, use an unsupervised learning algorithm to learn to obtain feature vectors of the multiple training documents.

It should be noted that, in the embodiments of the present application, the document refers to a domain document, especially an oil and gas domain document, that is, a training document refers to a training domain document, especially a training oil and gas domain document. Since the classical methods such as bag-of-words model and TF-IDF weight calculation are used to extract the classification features of domain documents, only the word frequency of the words in the domain documents is paid attention to, and the order and context of the words in the domain documents are ignored, making the extracted domain documents The categorization feature does not have the characteristics of universality, so the document classification model obtained by using the above-mentioned categorization feature training is easy to overfit, and its classification effect is poor when it is actually used to predict and categorize domain documents with unlabeled classification tags. Therefore, in the embodiments of the present application, it is necessary to consider the context of the words in the document and the correlation between the contexts in the same document, and then use the unsupervised learning algorithm to learn the context of the words in the training document, the vector of the words, and the training The identification of the document in order to obtain the feature vector of the training document with general characteristics.

It should be noted that the unsupervised learning algorithm is based on the context of the words in the training document, the word vector, and the identification of the training document. In fact, the feature vector of each word in the training document is directly obtained. For the characteristics of each word in the training document For vectors, fusion is needed to get the feature vector of the training document. Therefore, in an optional implementation manner of the embodiment of the present application, the step 201 may include the following steps, for example:

Step A: Based on the context of each word in each training document, the vector of each word, and the identifier corresponding to the training document, use an unsupervised learning algorithm to learn to obtain each word in each training document. Feature vectors of the words.

Among them, when step A is specifically implemented, firstly, each word in the training document needs to be obtained to clarify the context of each word in the training document and the vector of each word; and each word in the training document is used to perform word segmentation on the training document. Word segmentation is obtained; then, because unsupervised learning is actually to learn the context of each word in the training document and the correlation between the context of the same training document, you need to train the context of each word in the document And the identification of the training document is used as input, and the vector of each word is used as output, and unsupervised learning is performed to obtain the feature vector of each word in the training document. Therefore, in an optional implementation manner of the embodiment of the present application, the step A may include the following steps, for example:

Step A1: Use a word segmentation tool to segment each of the training documents to obtain each of the words in each of the training documents.

When step A1 is specifically implemented, for training domain documents, it is usually necessary to first introduce domain professional dictionaries combined with word segmentation tools to segment the training domain documents. For example, based on the oil and gas domain professional dictionaries combined with word segmentation tools to segment the training oil and gas domain documents.

Step A2: For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn to obtain each The feature vector of each of the words in the training document.

When step A2 is specifically implemented, an initial neural network model is actually set in advance, for example, a single hidden layer initial neural network model, and the initial neural network model includes initialization model parameters. Taking the context of each word and the identification of the document as the input, and the vector of each word as the output, the initial neural network model is trained using an unsupervised learning algorithm, which is actually the initialization model parameters included in the training of the initial neural network model; After each word is trained, the feature vector of each word can be obtained based on the model parameters after the initial neural network model training. Therefore, in an optional implementation manner of the embodiment of the present application, the step A2 may include, for example:

Step A21: For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and train an initial neural network model by using an unsupervised learning algorithm;

Step A22: Obtain a feature vector of each word in each training document based on the model parameters after the initial neural network model training.

Step B: Fusion feature vectors of each of the words in each training document to obtain a feature vector of each training document.

When step B is specifically implemented, the vector fusion formula can be preset as the preset vector fusion formula, and the feature vector of each word in each training document is substituted into the preset vector fusion formula to obtain the fusion feature vector as the feature vector of each training document. Of course, in the embodiment of the present application, other specific implementation manners may also be used to perform step B, as long as the fusion of the feature vectors of various words is achieved.

It should also be noted that the dimension of the feature vector can be set in advance. Considering that the dimension of the feature vector should not be too large, it can be set based on the total number of multiple training documents. Generally, the dimension of the feature vector is set to Less than the total number of multiple training documents, especially when the dimension of the feature vector is much smaller than the total number of multiple training documents, that is, the difference between the total number of multiple training documents and the dimension of the feature vector is greater than a certain preset difference, The dimension of the feature vector can be greatly reduced, and the efficiency of obtaining the feature vector of the document can be improved. Therefore, in an optional implementation manner of the embodiment of the present application, the dimension of the feature vector is smaller than the total number of multiple training documents.

It should also be noted that the unsupervised learning algorithm in step 201 can be a common doc2vec algorithm or a word2vec algorithm, where the doc2vec algorithm is an extension of the word2vec algorithm.

Step 202: Based on the feature vectors and classification labels of the multiple training documents, a binary classification algorithm is used to train to obtain a document classification model, where the classification labels are target category labels or non-target category labels.

It should be noted that the feature vector of the training document obtained in step 201 needs to be used to train the document classification model, and the training document needs to be marked with the target category label or the non-target category label as the classification label, and the two classification algorithm is used to analyze the characteristics of multiple training documents. Vectors and classification labels are trained to obtain a binary classification model as a document classification model.

Among them, the target category label may be, for example, various category labels under a professional label system such as "exploration" category label, "development" category label, "drilling" category label, "logging" category label, or "construction" category label.

It should also be noted that the training documents marked with target category labels are taken as positive samples, and the training documents marked with non-target category labels are taken as negative samples. Since the training documents marked with target category labels may account for a small proportion of multiple training documents, However, training documents with non-target category labels account for a large proportion of multiple training documents; for multiple training documents, there is a problem of imbalance between positive and negative samples, which affects the training effect of the document classification model to a certain extent. Therefore, in this embodiment of the application, a reasonable ratio of positive and negative samples can be preset as the preset ratio of positive and negative samples, and the ratio of target category labels and non-target category labels in multiple training documents can be adjusted to match the preset ratio. The ratio of positive and negative samples. In an optional implementation manner of the embodiment of the present application, for example, it may further include the step of: adjusting multiple training documents or the target category labels in the new training documents based on a preset ratio of positive and negative samples and the non- The proportion of target category labels. This method solves the problem of the imbalance of positive and negative samples in multiple training documents, and makes the training of the positive samples in the document classification model training process more fully, thereby improving the classification accuracy of the document classification model obtained by training.

During specific implementation, the ratio of target category labels to non-target category labels in multiple training documents can be adjusted based on the under-sampling method or the over-sampling method, so as to meet the preset ratio of positive and negative samples. Among them, the under-sampling method refers to sampling the negative samples in multiple training documents to reduce the number of negative samples, that is, sampling and reducing the number of training documents marked with non-target category labels in multiple training documents; the over-sampling method is Refers to repeating positive samples in multiple training documents to increase the number of positive samples, that is, repeating and increasing the number of training documents marked with target category labels in multiple training documents.

Through the various implementation manners provided in this embodiment, based on the context of the words in the multiple training documents, the word vectors and the identification of the multiple training documents, the unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; The feature vector and classification label of each training document are trained using a binary classification algorithm to obtain a document classification model; where the classification label is a target category label or a non-target category label. It can be seen that the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output. The feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document. The correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by the training has a better actual classification effect on the documents with unlabeled classification tags, thereby improving the classification accuracy of the document classification model.

It should be noted that in the face of a large number of documents, manually labeling a large number of documents with classification tags consumes a lot of manpower and material resources and wastes a lot of time. Therefore, the number of documents without classification tags far exceeds that of annotations. The number of documents with classification labels, that is, the total number of multiple training documents is not large, and it also affects the training effect of the document classification model to a certain extent. You need to augment the documents with unlabeled classification labels to obtain more new training. Documentation.

In practical applications, on the basis of the above-mentioned embodiment, a classification model with the function of predicting the classification probability is obtained in advance as the target category label or the non-target category label, and the preset classification model is used to A batch of unlabeled documents is tested, and the predicted classification labels and predicted classification probabilities of the batch of unlabeled documents can be obtained. A probability threshold is set in advance as the preset probability threshold for evaluating the reliability of predicted classification labels, based on Preset probability thresholds and predicted classification probabilities, filter documents with high reliability of predicted classification labels from a batch of unlabeled documents, and use them as new training documents for the next round of document classification model training, and retrain to implement the documents in the above embodiment Re-learning of classification models. Therefore, the specific implementation of another document classification model training method in the embodiment of the present application will be described in detail below with reference to FIG. 3 through an embodiment.

Referring to FIG. 3, it shows a schematic flowchart of another method for training a document classification model in an embodiment of the present application. In this embodiment, the method may include the following steps, for example:

Step 301: Based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, use an unsupervised learning algorithm to learn to obtain the feature vectors of the multiple training documents.

Step 302: Based on the feature vectors and classification labels of the multiple training documents, a binary classification algorithm is used to train to obtain a document classification model, where the classification labels are target category labels or non-target category labels.

It should be noted that, in the embodiment of the present application, step 301 to step 302 are the same as step 201 to step 202, and the specific implementation manner can refer to the relevant description in the foregoing embodiment, which will not be repeated here.

Step 303: Taking a plurality of documents that are not marked with the classification labels as unmarked documents, and predicting to obtain the predicted classification labels and predicted classification probabilities of the plurality of unmarked documents based on the unmarked documents and a preset classification model, so The preset classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label.

The preset classification model may be the document classification model obtained in step 302, that is, the binary classification algorithm used in the process of obtaining the document classification model in step 302 needs to have the function of predicting classification probability, for example, logistic regression algorithm. In this case, continue Performing the subsequent steps 304 to 305 can realize the self-learning of the document classification model. Of course, in the embodiment of the present application, the preset classification model is not limited to the document classification model obtained in step 302, and it can also be other document classification models, as long as the predicted classification label obtained is a target category label or a non-target category. Labels, and have the function of predicting the probability of classification.

Step 304: Screen the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents.

It should be noted that in the specific implementation of step 304, after selecting unlabeled documents whose predicted classification probability is higher than or equal to the preset probability threshold from multiple unlabeled documents, it can also be included in the review intervention of experts to confirm their predicted classification. Whether the label matches the actual classification label, so as to improve the reliability of the predicted classification label of the new training document.

Similarly, based on the description of the above embodiment, it can be seen that a new training document labeled with a target category label is used as a positive sample, and a new training document labeled with a non-target category label is used as a negative sample. For multiple new training documents, there may also be positive and negative. The problem of sample imbalance also needs to be adjusted based on the preset positive and negative sample ratio. Therefore, in an optional implementation manner of the embodiment of the present application, for example, it may further include the step of adjusting the target category label and the non-target category label in the multiple new training documents based on a preset ratio of positive and negative samples. proportion. This method solves the problem of the imbalance of positive and negative samples in multiple new training documents, and makes the training of the positive samples in the document classification model training process more fully, thereby improving the classification accuracy of the document classification model obtained by training.

Step 305: Based on the feature vectors and predicted classification labels of the multiple new training documents, use a binary classification algorithm to retrain the document classification model.

Through the various implementation manners provided in this embodiment, based on the context of the words in the multiple training documents, the vector of the words and the identification of the multiple training documents, the unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; The feature vector and classification label of the document are trained using a binary classification algorithm to obtain a document classification model; multiple documents with unlabeled classification labels are regarded as unlabeled documents, and based on the unlabeled documents and the preset classification model, the number of unlabeled documents is predicted Predict classification labels and predicted classification probabilities; filter multiple unlabeled documents whose predicted classification probabilities are higher than the preset probability threshold to obtain multiple new training documents; based on the feature vectors and predicted classification labels of multiple new training documents, use binary classification algorithm iteration Train a document classification model. It can be seen that the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output. The feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document. The correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by training has a better actual classification effect on the documents with unmarked classification tags; and the model re-learning scheme is designed and the classification model is preset Predict unlabeled documents, screen and predict unlabeled documents with high reliability of classification labels and expand them into new training documents to train the document classification model again, thereby improving the classification accuracy of the document classification model.

It should also be noted that, since the document classification model obtained through training in the foregoing embodiment has a better actual classification effect on documents that are not labeled with classification labels, the document classification model needs to be used to classify documents to be classified in practical applications. Therefore, the specific implementation of a method for document classification in an embodiment of the present application will be described in detail below with reference to FIG. 4 through an embodiment.

Step 401: Based on the context of the words in the document to be classified, the vector of the words and the identification of the document to be classified, learning to obtain the feature vector of the document to be classified using an unsupervised learning algorithm.

Step 402: Input the feature vector of the document to be classified into the document classification model for document classification.

Through the various implementation manners provided in this embodiment, based on the context of the words in the document to be classified, the word vector, and the identification of the document to be classified, an unsupervised learning algorithm is used to obtain the feature vector of the document to be classified; Vector input document classification model for document classification. It can be seen that the context of the word in the document to be classified and the identification of the document to be classified are used as input, and the vector of the word is used as the output. The feature vector of the document to be classified is extracted based on the unsupervised algorithm, taking the context of the word and the same document into account The relevance between the medium context and the context improves the versatility of the feature vector of the document to be classified; and the document classification model has a higher classification accuracy for the document to be classified without the classification label, and the actual classification effect is better.

Exemplary device

Referring to FIG. 5, there is shown a schematic structural diagram of an apparatus for document classification model training in an embodiment of the present application. In this embodiment, using the document classification model obtained through training in the foregoing embodiment, the device may specifically include, for example:

The first learning and obtaining unit 501 is configured to use an unsupervised learning algorithm to learn and obtain feature vectors of the multiple training documents based on the context of the words in the multiple training documents, the vectors of the words, and the identifications of the multiple training documents ；

The training obtaining unit 502 is configured to train and obtain a document classification model by using a two-classification algorithm based on the feature vectors and classification labels of a plurality of the training documents, and the classification labels are target category labels or non-target category labels.

In an optional implementation manner of the embodiment of the present application, the first learning obtaining unit 501 includes:

The learning and obtaining subunit is used for learning to obtain each of the said words based on the context of each said word in each said training document, the vector of each said word and the identification corresponding to said training document by using an unsupervised learning algorithm. The feature vector of each of the words in the training document;

The fusion obtaining subunit is used to fuse the feature vector of each of the words in each of the training documents to obtain the feature vector of each of the training documents.

In an optional implementation manner of the embodiment of the present application, the learning acquisition subunit includes:

The word segmentation obtaining module is configured to perform word segmentation on each of the training documents by using a word segmentation tool to obtain each of the words in each of the training documents;

The learning acquisition module is used for each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, and using unsupervised learning algorithm to learn to obtain each training document. A feature vector of each of the words in each of the training documents.

In an optional implementation manner of the embodiment of the present application, the learning acquisition module includes:

The training sub-module is used for each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, and training the initial nerve using an unsupervised learning algorithm Network model

An obtaining sub-module is used to obtain the feature vector of each word in each training document based on the model parameters after the initial neural network model training.

In an optional implementation manner of the embodiment of the present application, the dimension of the feature vector is less than the total number of multiple training documents.

In an optional implementation manner of the embodiment of this application, it further includes:

A prediction obtaining unit, configured to use a plurality of documents that are not marked with the classification label as an unmarked document, and predictively obtain the predicted classification labels and predicted classifications of the plurality of unmarked documents based on the unmarked document and a preset classification model Probability, the preset classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;

A screening and obtaining unit, configured to screen the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents;

The iterative training unit is used to train the document classification model again by using a two-classification algorithm based on the feature vectors and predicted classification labels of the multiple new training documents.

The adjustment unit is configured to adjust the ratio of the target category label to the non-target category label in the multiple training documents or the new training document based on a preset ratio of positive and negative samples.

Through the various implementation manners provided in this embodiment, first, based on the context of the words in the multiple training documents, the word vectors, and the identification of the multiple training documents, an unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; then, Based on the feature vectors and classification labels of multiple training documents, a binary classification algorithm is used to train to obtain a document classification model; where the classification labels are target category labels or non-target category labels. It can be seen that the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output. The feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document. The correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by the training has a better actual classification effect on the documents with unlabeled classification tags, thereby improving the classification accuracy of the document classification model.

The device for document classification model training includes a processor and a memory. The above-mentioned first learning acquisition unit and training acquisition unit are all stored as program units in the memory, and the processor executes the above-mentioned program units stored in the memory to implement corresponding Features.

The processor contains the kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to one or more. By adjusting the kernel parameters, the context of the word and the correlation between the context of the same document are considered, and the versatility of the feature vector of the document is improved; the document classification model obtained by training is The actual classification effect of the documents without the classification label is better, thereby improving the classification accuracy of the document classification model.

The memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip.

Referring to FIG. 6, there is shown a schematic structural diagram of an apparatus for document classification in an embodiment of the present application. In this embodiment, using the document classification model obtained through training in the foregoing embodiment, the device may specifically include, for example:

The second learning and obtaining unit 601 is configured to use an unsupervised learning algorithm to learn and obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word and the identification of the document to be classified;

The document classification unit 602 is configured to input the feature vector of the document to be classified into the document classification model for document classification.

The document classification device includes a processor and a memory. The second learning acquisition unit and the document classification unit are all stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.

The processor contains the kernel, and the kernel calls the corresponding program unit from the memory. One or more kernels can be set. By adjusting the kernel parameters, the contextual context of words and the relevance between contextual contexts in the same document are considered, and the versatility of the feature vector of the document to be classified is improved; The classification accuracy of the documents to be classified with the classification label is higher, and the actual classification effect is better.

The embodiment of the present application provides a storage medium on which a program is stored, and when the program is executed by a processor, the method for training the document classification model or the method for document classification is realized.

An embodiment of the present application provides a device that includes a processor, a memory, and a program stored on the memory and capable of running on the processor, and the processor implements the following steps when the program is executed:

Based on the feature vectors and classification labels of a plurality of the training documents, a document classification model is obtained by training using a binary classification algorithm, and the classification labels are target category labels or non-target category labels.

In an optional implementation manner of the embodiment of the present application, based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, an unsupervised learning algorithm is used to learn to obtain multiple The feature vector of the training document includes:

In an optional implementation manner of the embodiment of the present application, based on the context of each word in each training document, the vector of each word, and the identifier corresponding to the training document, use no The supervised learning algorithm learns to obtain the feature vector of each of the words in each of the training documents, including:

In an optional implementation manner of the embodiment of the present application, for each training document, the context of each word and the identification of the training document are used as input, and the vector of each word is Output, using an unsupervised learning algorithm to learn to obtain the feature vector of each of the words in each of the training documents, including:

In an optional implementation manner of the embodiment of the present application, after the document classification model is obtained by training using a binary classification algorithm, the method further includes:

The devices in this article can be servers, PCs, PADs, mobile phones, etc.

Or, when the processor executes the program, the document classification model trained by the method of document classification model training implements the following steps:

This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or they also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.

Those skilled in the art should understand that the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

The above are only examples of the application, and are not used to limit the application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims

A method for document classification model training is characterized in that it includes:

Based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, learning to obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm;

Based on the feature vectors and classification labels of the multiple training documents, a two-classification algorithm is used to train to obtain a document classification model, and the classification labels are target category labels or non-target category labels.
The method according to claim 1, characterized in that, based on the context of the words in a plurality of training documents, the vectors of the words and the identifications of the plurality of training documents, an unsupervised learning algorithm is used to learn to obtain a plurality of Describe the feature vector of the training document, including:

Based on the context of each word in each training document, the vector of each word, and the identification corresponding to the training document, an unsupervised learning algorithm is used to learn to obtain each of the training documents. Feature vector of words;

The feature vector of each of the words in each of the training documents is merged to obtain the feature vector of each of the training documents.
The method according to claim 2, characterized in that, based on the context of each of the words in each of the training documents, the vectors of each of the words, and the identifiers corresponding to the training documents, using unsupervised The learning algorithm learns to obtain the feature vector of each of the words in each of the training documents, including:

Use a word segmentation tool to segment each of the training documents to obtain each of the words in each of the training documents;

For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn from each training document. The feature vector of each of the words.
The method according to claim 3, wherein, for each of the training documents, the context of each of the words and the identification of the training document are used as input, and the vector of each of the words is used as the output , Using an unsupervised learning algorithm to learn to obtain the feature vector of each of the words in each of the training documents, including:

For each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, training an initial neural network model by using an unsupervised learning algorithm;

Based on the model parameters after the initial neural network model training, a feature vector of each word in each training document is obtained.
The method according to any one of claims 1 to 4, wherein the dimension of the feature vector is less than the total number of multiple training documents.
The method according to any one of claims 1 to 5, characterized in that, after the document classification model is obtained by training by using a two-class classification algorithm, the method further comprises:

Taking a plurality of documents that are not marked with the classification label as an unmarked document, and predicting and obtaining the predicted classification labels and predicted classification probabilities of the plurality of unmarked documents based on the unmarked document and a preset classification model, the preset The classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;

Screening the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents;

Based on the feature vectors and predicted classification labels of the multiple new training documents, the document classification model is retrained by using a binary classification algorithm.
The method according to any one of claims 1-6, further comprising:

Adjusting the ratio of the target category label to the non-target category label in the multiple training documents or the new training document based on a preset ratio of positive and negative samples.
A method for document classification, characterized in that a document classification model trained by the method for document classification model training according to any one of claims 1 to 7, said method comprising:

Learning to obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word and the identification of the document to be classified, using an unsupervised learning algorithm;

The feature vector of the document to be classified is input into the document classification model for document classification.
A device for training a document classification model is characterized in that it comprises:

The first learning obtaining unit is configured to learn and obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm based on the context of the words in the multiple training documents, the vectors of the words, and the identifications of the multiple training documents;

The training obtaining unit is configured to train to obtain a document classification model based on the feature vectors and classification labels of a plurality of the training documents, and the classification label is a target category label or a non-target category label.
A device for document classification, characterized in that a document classification model trained by the method for training a document classification model according to any one of claims 1 to 7, said device comprising:

The second learning and obtaining unit is configured to use an unsupervised learning algorithm to learn and obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word, and the identification of the document to be classified;

The document classification unit is used to input the feature vector of the document to be classified into the document classification model for document classification.