WO2021057133A1 - Method for training document classification model, and related apparatus - Google Patents

Method for training document classification model, and related apparatus Download PDF

Info

Publication number
WO2021057133A1
WO2021057133A1 PCT/CN2020/097869 CN2020097869W WO2021057133A1 WO 2021057133 A1 WO2021057133 A1 WO 2021057133A1 CN 2020097869 W CN2020097869 W CN 2020097869W WO 2021057133 A1 WO2021057133 A1 WO 2021057133A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
training
classification
documents
word
Prior art date
Application number
PCT/CN2020/097869
Other languages
French (fr)
Chinese (zh)
Inventor
任卓
Original Assignee
北京国双科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京国双科技有限公司 filed Critical 北京国双科技有限公司
Publication of WO2021057133A1 publication Critical patent/WO2021057133A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Definitions

  • This application relates to the field of data processing technology, and in particular to a method and related device for training a document classification model.
  • domain document classification usually uses classic methods such as bag-of-words model and TF-IDF weight calculation to extract the classification features of domain documents.
  • This classification feature often focuses on the word frequency of the words in the domain document and ignores the order and context of the words. It has the characteristics of versatility, and the training document classification model is prone to over-fitting phenomenon; the document classification model obtained by training has a poor actual classification effect on the domain documents that are not labeled with the classification label. That is to say, the current classification feature extraction method of domain documents leads to low classification accuracy of the document classification model obtained by training.
  • the present application provides a method and related device for document classification model training, so that the document classification model obtained by training has a better actual classification effect on documents with unmarked classification tags, thereby improving the classification accuracy of the document classification model.
  • an embodiment of the present application provides a method for document classification model training, which includes:
  • the vectors of the words and the identifications of the multiple training documents learning to obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm
  • a two-classification algorithm is used to train to obtain a document classification model, and the classification labels are target category labels or non-target category labels.
  • the learning to obtain the feature vectors of the multiple training documents based on the context of the words in the multiple training documents, the vectors of the words, and the identifications of the multiple training documents by using an unsupervised learning algorithm includes:
  • an unsupervised learning algorithm is used to learn to obtain each of the training documents.
  • the feature vector of each of the words in each of the training documents is merged to obtain the feature vector of each of the training documents.
  • each training document is learned by using an unsupervised learning algorithm.
  • the feature vector of each word in the document includes:
  • For each training document take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn from each training document.
  • the feature vector of each of the words is
  • the context of each word and the identification of the training document are used as input, the vector of each word is output, and the unsupervised learning algorithm is used to learn to obtain each
  • the feature vector of each of the words in the training documents includes:
  • the dimension of the feature vector is less than the total number of multiple training documents.
  • the method further includes:
  • the preset The classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;
  • the document classification model is retrained by using a binary classification algorithm.
  • it also includes:
  • an embodiment of the present application provides a method for document classification, using a document classification model trained by the method for training a document classification model according to any one of the above-mentioned first aspects, and the method includes:
  • the feature vector of the document to be classified is input into the document classification model for document classification.
  • an embodiment of the present application provides a device for training a document classification model, and the device includes:
  • the first learning obtaining unit is configured to learn and obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm based on the context of the words in the multiple training documents, the vectors of the words, and the identifications of the multiple training documents;
  • the training obtaining unit is configured to train and obtain a document classification model by using a two-classification algorithm based on the feature vectors and classification labels of a plurality of the training documents, and the classification labels are target category labels or non-target category labels.
  • an embodiment of the present application provides an apparatus for document classification, and a document classification model trained by the method for training a document classification model according to any one of the above-mentioned first aspects, the apparatus includes:
  • the second learning and obtaining unit is configured to use an unsupervised learning algorithm to learn and obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word, and the identification of the document to be classified;
  • the document classification unit is used to input the feature vector of the document to be classified into the document classification model for document classification.
  • an unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents;
  • the feature vector and classification label of each training document are trained using a binary classification algorithm to obtain a document classification model; where the classification label is a target category label or a non-target category label.
  • the feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document.
  • the correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by the training has a better actual classification effect on the documents with unlabeled classification tags, thereby improving the classification accuracy of the document classification model.
  • FIG. 1 shows a schematic diagram of a system framework involved in an application scenario in an embodiment of the present application
  • Fig. 2 shows a schematic flowchart of a method for training a document classification model provided by an embodiment of the present application
  • FIG. 3 shows a schematic flowchart of another method for training a document classification model provided by an embodiment of the present application
  • FIG. 4 shows a schematic flowchart of a method for document classification according to an embodiment of the present application
  • FIG. 5 shows a schematic structural diagram of a device for training a document classification model provided by an embodiment of the present application
  • Fig. 6 shows a schematic structural diagram of a document classification device provided by an embodiment of the present application.
  • the classification of documents has great application value in the fields of document query, document cluster management and document recommendation. Determining document classification is an upstream task in these areas. It provides data support for downstream document processing tasks. If the document classification is not accurate, it will further affect the subsequent document processing effect.
  • an unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; Then, based on the feature vectors and classification labels of multiple training documents, a binary classification algorithm is used to train to obtain a document classification model; where the classification labels are target category labels or non-target category labels. It can be seen that the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output.
  • the feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document.
  • the correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by the training has a better actual classification effect on the documents with unlabeled classification tags, thereby improving the classification accuracy of the document classification model.
  • users can recommend other documents belonging to the same category after reading a certain document.
  • the solution of the present invention can accurately classify the massive documents existing on the Internet in advance, find the target documents with the same classification as the documents read by the user among the massive documents, recommend the target documents to the users, and make the pushed documents more accurately conform to the users Preferences.
  • the present invention can also cluster and manage papers on the paper website. There are many application fields of the present invention. While accurately classifying documents in these fields, the present invention can also avoid the problem of manpower and material resources due to manual extraction of keywords.
  • the scenario includes a terminal 101 and a processor 102, where the terminal 101 may be a PC or other Mobile terminals, such as mobile phones or tablets, etc.
  • the user determines through the terminal 101 that multiple training documents are sent to the processor 102; the processor 102 uses the first step in the embodiment of the present application to obtain the feature vectors of the multiple training documents; the processor 102 uses the implementation of the embodiment of the present application In the second step, the document classification model is obtained.
  • FIG. 2 shows a schematic flowchart of a method for training a document classification model in an embodiment of the present application.
  • the method may include the following steps, for example:
  • Step 201 Based on the context of the words in the multiple training documents, the vectors of the words and the identities of the multiple training documents, use an unsupervised learning algorithm to learn to obtain feature vectors of the multiple training documents.
  • the document refers to a domain document, especially an oil and gas domain document
  • a training document refers to a training domain document, especially a training oil and gas domain document.
  • the classical methods such as bag-of-words model and TF-IDF weight calculation are used to extract the classification features of domain documents, only the word frequency of the words in the domain documents is paid attention to, and the order and context of the words in the domain documents are ignored, making the extracted domain documents
  • the categorization feature does not have the characteristics of universality, so the document classification model obtained by using the above-mentioned categorization feature training is easy to overfit, and its classification effect is poor when it is actually used to predict and categorize domain documents with unlabeled classification tags.
  • the unsupervised learning algorithm is based on the context of the words in the training document, the word vector, and the identification of the training document.
  • the feature vector of each word in the training document is directly obtained.
  • fusion is needed to get the feature vector of the training document. Therefore, in an optional implementation manner of the embodiment of the present application, the step 201 may include the following steps, for example:
  • Step A Based on the context of each word in each training document, the vector of each word, and the identifier corresponding to the training document, use an unsupervised learning algorithm to learn to obtain each word in each training document. Feature vectors of the words.
  • step A when step A is specifically implemented, firstly, each word in the training document needs to be obtained to clarify the context of each word in the training document and the vector of each word; and each word in the training document is used to perform word segmentation on the training document. Word segmentation is obtained; then, because unsupervised learning is actually to learn the context of each word in the training document and the correlation between the context of the same training document, you need to train the context of each word in the document And the identification of the training document is used as input, and the vector of each word is used as output, and unsupervised learning is performed to obtain the feature vector of each word in the training document. Therefore, in an optional implementation manner of the embodiment of the present application, the step A may include the following steps, for example:
  • Step A1 Use a word segmentation tool to segment each of the training documents to obtain each of the words in each of the training documents.
  • step A1 When step A1 is specifically implemented, for training domain documents, it is usually necessary to first introduce domain professional dictionaries combined with word segmentation tools to segment the training domain documents. For example, based on the oil and gas domain professional dictionaries combined with word segmentation tools to segment the training oil and gas domain documents.
  • Step A2 For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn to obtain each The feature vector of each of the words in the training document.
  • an initial neural network model is actually set in advance, for example, a single hidden layer initial neural network model, and the initial neural network model includes initialization model parameters.
  • the initial neural network model is trained using an unsupervised learning algorithm, which is actually the initialization model parameters included in the training of the initial neural network model; After each word is trained, the feature vector of each word can be obtained based on the model parameters after the initial neural network model training. Therefore, in an optional implementation manner of the embodiment of the present application, the step A2 may include, for example:
  • Step A21 For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and train an initial neural network model by using an unsupervised learning algorithm;
  • Step A22 Obtain a feature vector of each word in each training document based on the model parameters after the initial neural network model training.
  • Step B Fusion feature vectors of each of the words in each training document to obtain a feature vector of each training document.
  • the vector fusion formula can be preset as the preset vector fusion formula, and the feature vector of each word in each training document is substituted into the preset vector fusion formula to obtain the fusion feature vector as the feature vector of each training document.
  • the vector fusion formula can be preset as the preset vector fusion formula, and the feature vector of each word in each training document is substituted into the preset vector fusion formula to obtain the fusion feature vector as the feature vector of each training document.
  • step B other specific implementation manners may also be used to perform step B, as long as the fusion of the feature vectors of various words is achieved.
  • the dimension of the feature vector can be set in advance. Considering that the dimension of the feature vector should not be too large, it can be set based on the total number of multiple training documents. Generally, the dimension of the feature vector is set to Less than the total number of multiple training documents, especially when the dimension of the feature vector is much smaller than the total number of multiple training documents, that is, the difference between the total number of multiple training documents and the dimension of the feature vector is greater than a certain preset difference, The dimension of the feature vector can be greatly reduced, and the efficiency of obtaining the feature vector of the document can be improved. Therefore, in an optional implementation manner of the embodiment of the present application, the dimension of the feature vector is smaller than the total number of multiple training documents.
  • the unsupervised learning algorithm in step 201 can be a common doc2vec algorithm or a word2vec algorithm, where the doc2vec algorithm is an extension of the word2vec algorithm.
  • Step 202 Based on the feature vectors and classification labels of the multiple training documents, a binary classification algorithm is used to train to obtain a document classification model, where the classification labels are target category labels or non-target category labels.
  • the feature vector of the training document obtained in step 201 needs to be used to train the document classification model, and the training document needs to be marked with the target category label or the non-target category label as the classification label, and the two classification algorithm is used to analyze the characteristics of multiple training documents.
  • Vectors and classification labels are trained to obtain a binary classification model as a document classification model.
  • the target category label may be, for example, various category labels under a professional label system such as "exploration” category label, "development” category label, "drilling” category label, "logging” category label, or "construction” category label.
  • the training documents marked with target category labels are taken as positive samples, and the training documents marked with non-target category labels are taken as negative samples. Since the training documents marked with target category labels may account for a small proportion of multiple training documents, However, training documents with non-target category labels account for a large proportion of multiple training documents; for multiple training documents, there is a problem of imbalance between positive and negative samples, which affects the training effect of the document classification model to a certain extent. Therefore, in this embodiment of the application, a reasonable ratio of positive and negative samples can be preset as the preset ratio of positive and negative samples, and the ratio of target category labels and non-target category labels in multiple training documents can be adjusted to match the preset ratio. The ratio of positive and negative samples.
  • it may further include the step of: adjusting multiple training documents or the target category labels in the new training documents based on a preset ratio of positive and negative samples and the non- The proportion of target category labels.
  • the ratio of target category labels to non-target category labels in multiple training documents can be adjusted based on the under-sampling method or the over-sampling method, so as to meet the preset ratio of positive and negative samples.
  • the under-sampling method refers to sampling the negative samples in multiple training documents to reduce the number of negative samples, that is, sampling and reducing the number of training documents marked with non-target category labels in multiple training documents;
  • the over-sampling method is Refers to repeating positive samples in multiple training documents to increase the number of positive samples, that is, repeating and increasing the number of training documents marked with target category labels in multiple training documents.
  • the unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents;
  • the feature vector and classification label of each training document are trained using a binary classification algorithm to obtain a document classification model; where the classification label is a target category label or a non-target category label.
  • the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output.
  • the feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document.
  • the correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by the training has a better actual classification effect on the documents with unlabeled classification tags, thereby improving the classification accuracy of the document classification model.
  • a classification model with the function of predicting the classification probability is obtained in advance as the target category label or the non-target category label, and the preset classification model is used to A batch of unlabeled documents is tested, and the predicted classification labels and predicted classification probabilities of the batch of unlabeled documents can be obtained.
  • a probability threshold is set in advance as the preset probability threshold for evaluating the reliability of predicted classification labels, based on Preset probability thresholds and predicted classification probabilities, filter documents with high reliability of predicted classification labels from a batch of unlabeled documents, and use them as new training documents for the next round of document classification model training, and retrain to implement the documents in the above embodiment Re-learning of classification models. Therefore, the specific implementation of another document classification model training method in the embodiment of the present application will be described in detail below with reference to FIG. 3 through an embodiment.
  • FIG. 3 shows a schematic flowchart of another method for training a document classification model in an embodiment of the present application.
  • the method may include the following steps, for example:
  • Step 301 Based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, use an unsupervised learning algorithm to learn to obtain the feature vectors of the multiple training documents.
  • Step 302 Based on the feature vectors and classification labels of the multiple training documents, a binary classification algorithm is used to train to obtain a document classification model, where the classification labels are target category labels or non-target category labels.
  • step 301 to step 302 are the same as step 201 to step 202, and the specific implementation manner can refer to the relevant description in the foregoing embodiment, which will not be repeated here.
  • Step 303 Taking a plurality of documents that are not marked with the classification labels as unmarked documents, and predicting to obtain the predicted classification labels and predicted classification probabilities of the plurality of unmarked documents based on the unmarked documents and a preset classification model, so The preset classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label.
  • the preset classification model may be the document classification model obtained in step 302, that is, the binary classification algorithm used in the process of obtaining the document classification model in step 302 needs to have the function of predicting classification probability, for example, logistic regression algorithm. In this case, continue Performing the subsequent steps 304 to 305 can realize the self-learning of the document classification model.
  • the preset classification model is not limited to the document classification model obtained in step 302, and it can also be other document classification models, as long as the predicted classification label obtained is a target category label or a non-target category. Labels, and have the function of predicting the probability of classification.
  • Step 304 Screen the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents.
  • step 304 after selecting unlabeled documents whose predicted classification probability is higher than or equal to the preset probability threshold from multiple unlabeled documents, it can also be included in the review intervention of experts to confirm their predicted classification. Whether the label matches the actual classification label, so as to improve the reliability of the predicted classification label of the new training document.
  • a new training document labeled with a target category label is used as a positive sample
  • a new training document labeled with a non-target category label is used as a negative sample.
  • multiple new training documents there may also be positive and negative.
  • the problem of sample imbalance also needs to be adjusted based on the preset positive and negative sample ratio. Therefore, in an optional implementation manner of the embodiment of the present application, for example, it may further include the step of adjusting the target category label and the non-target category label in the multiple new training documents based on a preset ratio of positive and negative samples. proportion.
  • This method solves the problem of the imbalance of positive and negative samples in multiple new training documents, and makes the training of the positive samples in the document classification model training process more fully, thereby improving the classification accuracy of the document classification model obtained by training.
  • Step 305 Based on the feature vectors and predicted classification labels of the multiple new training documents, use a binary classification algorithm to retrain the document classification model.
  • the unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents;
  • the feature vector and classification label of the document are trained using a binary classification algorithm to obtain a document classification model;
  • multiple documents with unlabeled classification labels are regarded as unlabeled documents, and based on the unlabeled documents and the preset classification model, the number of unlabeled documents is predicted Predict classification labels and predicted classification probabilities; filter multiple unlabeled documents whose predicted classification probabilities are higher than the preset probability threshold to obtain multiple new training documents; based on the feature vectors and predicted classification labels of multiple new training documents, use binary classification algorithm iteration Train a document classification model.
  • the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output.
  • the feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document.
  • the correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by training has a better actual classification effect on the documents with unmarked classification tags; and the model re-learning scheme is designed and the classification model is preset Predict unlabeled documents, screen and predict unlabeled documents with high reliability of classification labels and expand them into new training documents to train the document classification model again, thereby improving the classification accuracy of the document classification model.
  • Step 401 Based on the context of the words in the document to be classified, the vector of the words and the identification of the document to be classified, learning to obtain the feature vector of the document to be classified using an unsupervised learning algorithm.
  • Step 402 Input the feature vector of the document to be classified into the document classification model for document classification.
  • an unsupervised learning algorithm is used to obtain the feature vector of the document to be classified;
  • Vector input document classification model for document classification. It can be seen that the context of the word in the document to be classified and the identification of the document to be classified are used as input, and the vector of the word is used as the output.
  • the feature vector of the document to be classified is extracted based on the unsupervised algorithm, taking the context of the word and the same document into account
  • the relevance between the medium context and the context improves the versatility of the feature vector of the document to be classified; and the document classification model has a higher classification accuracy for the document to be classified without the classification label, and the actual classification effect is better.
  • the device may specifically include, for example:
  • the first learning and obtaining unit 501 is configured to use an unsupervised learning algorithm to learn and obtain feature vectors of the multiple training documents based on the context of the words in the multiple training documents, the vectors of the words, and the identifications of the multiple training documents ;
  • the training obtaining unit 502 is configured to train and obtain a document classification model by using a two-classification algorithm based on the feature vectors and classification labels of a plurality of the training documents, and the classification labels are target category labels or non-target category labels.
  • the first learning obtaining unit 501 includes:
  • the learning and obtaining subunit is used for learning to obtain each of the said words based on the context of each said word in each said training document, the vector of each said word and the identification corresponding to said training document by using an unsupervised learning algorithm.
  • the fusion obtaining subunit is used to fuse the feature vector of each of the words in each of the training documents to obtain the feature vector of each of the training documents.
  • the learning acquisition subunit includes:
  • the word segmentation obtaining module is configured to perform word segmentation on each of the training documents by using a word segmentation tool to obtain each of the words in each of the training documents;
  • the learning acquisition module is used for each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, and using unsupervised learning algorithm to learn to obtain each training document.
  • the learning acquisition module includes:
  • the training sub-module is used for each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, and training the initial nerve using an unsupervised learning algorithm Network model
  • An obtaining sub-module is used to obtain the feature vector of each word in each training document based on the model parameters after the initial neural network model training.
  • the dimension of the feature vector is less than the total number of multiple training documents.
  • a prediction obtaining unit configured to use a plurality of documents that are not marked with the classification label as an unmarked document, and predictively obtain the predicted classification labels and predicted classifications of the plurality of unmarked documents based on the unmarked document and a preset classification model Probability, the preset classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;
  • a screening and obtaining unit configured to screen the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents
  • the iterative training unit is used to train the document classification model again by using a two-classification algorithm based on the feature vectors and predicted classification labels of the multiple new training documents.
  • the adjustment unit is configured to adjust the ratio of the target category label to the non-target category label in the multiple training documents or the new training document based on a preset ratio of positive and negative samples.
  • an unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; then, Based on the feature vectors and classification labels of multiple training documents, a binary classification algorithm is used to train to obtain a document classification model; where the classification labels are target category labels or non-target category labels.
  • a binary classification algorithm is used to train to obtain a document classification model; where the classification labels are target category labels or non-target category labels.
  • the device for document classification model training includes a processor and a memory.
  • the above-mentioned first learning acquisition unit and training acquisition unit are all stored as program units in the memory, and the processor executes the above-mentioned program units stored in the memory to implement corresponding Features.
  • the processor contains the kernel, and the kernel calls the corresponding program unit from the memory.
  • the kernel can be set to one or more.
  • the memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash random access memory
  • the device may specifically include, for example:
  • the second learning and obtaining unit 601 is configured to use an unsupervised learning algorithm to learn and obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word and the identification of the document to be classified;
  • the document classification unit 602 is configured to input the feature vector of the document to be classified into the document classification model for document classification.
  • an unsupervised learning algorithm is used to obtain the feature vector of the document to be classified;
  • Vector input document classification model for document classification. It can be seen that the context of the word in the document to be classified and the identification of the document to be classified are used as input, and the vector of the word is used as the output.
  • the feature vector of the document to be classified is extracted based on the unsupervised algorithm, taking the context of the word and the same document into account
  • the relevance between the medium context and the context improves the versatility of the feature vector of the document to be classified; and the document classification model has a higher classification accuracy for the document to be classified without the classification label, and the actual classification effect is better.
  • the document classification device includes a processor and a memory.
  • the second learning acquisition unit and the document classification unit are all stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.
  • the processor contains the kernel, and the kernel calls the corresponding program unit from the memory.
  • One or more kernels can be set.
  • the memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash random access memory
  • the embodiment of the present application provides a storage medium on which a program is stored, and when the program is executed by a processor, the method for training the document classification model or the method for document classification is realized.
  • An embodiment of the present application provides a device that includes a processor, a memory, and a program stored on the memory and capable of running on the processor, and the processor implements the following steps when the program is executed:
  • the vectors of the words and the identifications of the multiple training documents learning to obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm
  • a document classification model is obtained by training using a binary classification algorithm, and the classification labels are target category labels or non-target category labels.
  • an unsupervised learning algorithm is used to learn to obtain multiple
  • the feature vector of the training document includes:
  • an unsupervised learning algorithm is used to learn to obtain each of the training documents.
  • the feature vector of each of the words in each of the training documents is merged to obtain the feature vector of each of the training documents.
  • the supervised learning algorithm learns to obtain the feature vector of each of the words in each of the training documents, including:
  • For each training document take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn from each training document.
  • the feature vector of each of the words is
  • the context of each word and the identification of the training document are used as input, and the vector of each word is Output, using an unsupervised learning algorithm to learn to obtain the feature vector of each of the words in each of the training documents, including:
  • the dimension of the feature vector is less than the total number of multiple training documents.
  • the method further includes:
  • the preset The classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;
  • the document classification model is retrained by using a binary classification algorithm.
  • the devices in this article can be servers, PCs, PADs, mobile phones, etc.
  • the document classification model trained by the method of document classification model training implements the following steps:
  • the feature vector of the document to be classified is input into the document classification model for document classification.
  • This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:
  • the vectors of the words and the identifications of the multiple training documents learning to obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm
  • a two-classification algorithm is used to train to obtain a document classification model, and the classification labels are target category labels or non-target category labels.
  • an unsupervised learning algorithm is used to learn to obtain multiple
  • the feature vector of the training document includes:
  • an unsupervised learning algorithm is used to learn to obtain each of the training documents.
  • the feature vector of each of the words in each of the training documents is merged to obtain the feature vector of each of the training documents.
  • the supervised learning algorithm learns to obtain the feature vector of each of the words in each of the training documents, including:
  • For each training document take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn from each training document.
  • the feature vector of each of the words is
  • the context of each word and the identification of the training document are used as input, and the vector of each word is Output, using an unsupervised learning algorithm to learn to obtain the feature vector of each of the words in each of the training documents, including:
  • the dimension of the feature vector is less than the total number of multiple training documents.
  • the method further includes:
  • the preset The classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;
  • the document classification model is retrained by using a binary classification algorithm.
  • the document classification model trained by the method of document classification model training implements the following steps:
  • the feature vector of the document to be classified is input into the document classification model for document classification.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM).
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • this application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed are a method for training a document classification model, and a related apparatus. The method comprises: on the basis of the context of a word in a document, a vector of the word, and an identifier of the document, obtaining a feature vector of the document by using an unsupervised learning algorithm; and taking documents labeled with classification tags as training documents, and on the basis of feature vectors and classification tags of a plurality of training documents, obtaining a document classification model by means of training with a dichotomy algorithm, wherein the classification tags are target category tags or non-target category tags. It can be seen that a feature vector of a document is extracted on the basis of the unsupervised algorithm by taking the context of the word in the document and the identifier of the document as an input, and the vector of the word as an output, and by taking the correlation between the context of the word and the context in the same document into account, the universality of the feature vector of the document is improved, such that the actual classification effect of a document classification model obtained through training with regard to documents which are not labeled with classification tags is better, thereby improving the classification accuracy of the document classification model.

Description

一种文档分类模型训练的方法和相关装置Method and related device for document classification model training
本申请要求于2019年09月24日提交中国专利局、申请号为201910907014.4、发明名称为“一种文档分类模型训练的方法和相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 24, 2019, the application number is 201910907014.4, and the invention title is "a method and related device for document classification model training", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及数据处理技术领域,尤其涉及一种文档分类模型训练的方法和相关装置。This application relates to the field of data processing technology, and in particular to a method and related device for training a document classification model.
背景技术Background technique
随着知识工程的快速发展以及油气行业数字化的推进,至今为止知识积累形成了海量的油气领域文档,充分、高效利用油气领域文档逐渐成为数字化油田建设的重点。With the rapid development of knowledge engineering and the advancement of digitization in the oil and gas industry, so far the accumulation of knowledge has formed a large number of oil and gas field documents, and the full and efficient use of oil and gas field documents has gradually become the focus of digital oilfield construction.
实现油气领域文档充分、高效利用,需要符合支持快速获取专业知识查询需求,满足知识检索、知识问答、信息抽提等应用需求,这些均依赖于油气领域文档的分类管理,即,需要在领域专家制定的专业标签体系下,为海量的油气领域文档标记合理的类别标签,例如,勘探、开发、钻井、测井、建设等众多类别标签。To achieve full and efficient use of documents in the oil and gas field, it is necessary to meet the requirements of supporting rapid acquisition of professional knowledge query, and to meet the application requirements of knowledge retrieval, knowledge question and answer, and information extraction. These all rely on the classification and management of documents in the oil and gas field, that is, experts in the field are required. Under the established professional labeling system, a large number of oil and gas field documents are marked with reasonable category labels, such as exploration, development, drilling, logging, construction, and many other category labels.
目前,领域文档分类通常采用词袋模型、TF-IDF权值计算等经典方法提取领域文档的分类特征,该分类特征往往是注重领域文档中词语的词频而忽略了词语的顺序与语境,不具有通用性的特点,训练文档分类模型容易出现过拟合现象;使得训练获得的文档分类模型对未标记分类标签的领域文档的实际分类效果较差。也就是说,目前领域文档的分类特征提取方式导致训练获得文档分类模型的分类准确率较低。At present, domain document classification usually uses classic methods such as bag-of-words model and TF-IDF weight calculation to extract the classification features of domain documents. This classification feature often focuses on the word frequency of the words in the domain document and ignores the order and context of the words. It has the characteristics of versatility, and the training document classification model is prone to over-fitting phenomenon; the document classification model obtained by training has a poor actual classification effect on the domain documents that are not labeled with the classification label. That is to say, the current classification feature extraction method of domain documents leads to low classification accuracy of the document classification model obtained by training.
发明内容Summary of the invention
鉴于上述问题,本申请提供一种文档分类模型训练的方法和相关装置,使得训练获得的文档分类模型对未标记分类标签的文档的实际分类效果较 好,从而提高文档分类模型的分类准确率。In view of the above problems, the present application provides a method and related device for document classification model training, so that the document classification model obtained by training has a better actual classification effect on documents with unmarked classification tags, thereby improving the classification accuracy of the document classification model.
第一方面,本申请实施例提供了一种文档分类模型训练的方法,该方法包括:In the first aspect, an embodiment of the present application provides a method for document classification model training, which includes:
基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量;Based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, learning to obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm;
基于多个所述训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型,所述分类标签为目标类别标签或非目标类别标签。Based on the feature vectors and classification labels of the multiple training documents, a two-classification algorithm is used to train to obtain a document classification model, and the classification labels are target category labels or non-target category labels.
可选的,所述基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量,包括:Optionally, the learning to obtain the feature vectors of the multiple training documents based on the context of the words in the multiple training documents, the vectors of the words, and the identifications of the multiple training documents by using an unsupervised learning algorithm includes:
基于每个所述训练文档中每个所述词语的上下文、每个所述词语的向量和对应所述训练文档的标识,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量;Based on the context of each word in each training document, the vector of each word, and the identification corresponding to the training document, an unsupervised learning algorithm is used to learn to obtain each of the training documents. Feature vector of words;
融合每个所述训练文档中各个所述词语的特征向量,获得每个所述训练文档的特征向量。The feature vector of each of the words in each of the training documents is merged to obtain the feature vector of each of the training documents.
可选的,所述基于每个所述训练文档中每个所述词语的上下文、每个所述词语的向量和对应所述训练文档的标识,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量,包括:Optionally, based on the context of each word in each training document, the vector of each word, and the identification corresponding to the training document, each training document is learned by using an unsupervised learning algorithm. The feature vector of each word in the document includes:
利用分词工具对每个所述训练文档进行分词获得每个所述训练文档中各个所述词语;Use a word segmentation tool to segment each of the training documents to obtain each of the words in each of the training documents;
针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量。For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn from each training document. The feature vector of each of the words.
可选的,所述针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量,包括:Optionally, for each of the training documents, the context of each word and the identification of the training document are used as input, the vector of each word is output, and the unsupervised learning algorithm is used to learn to obtain each The feature vector of each of the words in the training documents includes:
针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法训练初始神经网络模型;For each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, training an initial neural network model by using an unsupervised learning algorithm;
基于所述初始神经网络模型训练后的模型参数,获得每个所述训练文档中每个所述词语的特征向量。Based on the model parameters after the initial neural network model training, a feature vector of each word in each training document is obtained.
可选的,所述特征向量的维度小于多个所述训练文档的总数量。Optionally, the dimension of the feature vector is less than the total number of multiple training documents.
可选的,在所述利用二分类算法训练获得文档分类模型之后,还包括:Optionally, after the two-classification algorithm is used to train to obtain the document classification model, the method further includes:
将多个未标记所述分类标签的文档作为未标记文档,基于所述未标记文档和预设分类模型,预测获得多个所述未标记文档的预测分类标签和预测分类概率,所述预设分类模型具备预测分类概率的功能,所述预测分类标签为所述目标类别标签或所述非目标类别标签;Taking a plurality of documents that are not marked with the classification label as an unmarked document, and predicting and obtaining the predicted classification labels and predicted classification probabilities of the plurality of unmarked documents based on the unmarked document and a preset classification model, the preset The classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;
筛选所述预测分类概率高于所述预设概率阈值的多个所述未标记文档获得多个新训练文档;Screening the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents;
基于多个所述新训练文档的特征向量和预测分类标签,利用二分类算法再次训练所述文档分类模型。Based on the feature vectors and predicted classification labels of the multiple new training documents, the document classification model is retrained by using a binary classification algorithm.
可选的,还包括:Optionally, it also includes:
基于预设正负样本比例调整多个所述训练文档或所述新训练文档中所述目标类别标签与所述非目标类别标签的比例。Adjusting the ratio of the target category label to the non-target category label in the multiple training documents or the new training document based on a preset ratio of positive and negative samples.
第二方面,本申请实施例提供了一种文档分类的方法,利用上述第一方面任一项所述的文档分类模型训练的方法所训练的文档分类模型,该方法包括:In a second aspect, an embodiment of the present application provides a method for document classification, using a document classification model trained by the method for training a document classification model according to any one of the above-mentioned first aspects, and the method includes:
基于待分类文档中词语的上下文、所述词语的向量和所述待分类文档的标识,利用无监督学习算法学习获得所述待分类文档的特征向量;Learning to obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word and the identification of the document to be classified, using an unsupervised learning algorithm;
将所述待分类文档的特征向量输入所述文档分类模型进行文档分类。The feature vector of the document to be classified is input into the document classification model for document classification.
第三方面,本申请实施例提供了一种文档分类模型训练的装置,该装置包括:In a third aspect, an embodiment of the present application provides a device for training a document classification model, and the device includes:
第一学习获得单元,用于基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量;The first learning obtaining unit is configured to learn and obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm based on the context of the words in the multiple training documents, the vectors of the words, and the identifications of the multiple training documents;
训练获得单元,用于基于多个所述训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型,所述分类标签为目标类别标签或非目标类别标签。The training obtaining unit is configured to train and obtain a document classification model by using a two-classification algorithm based on the feature vectors and classification labels of a plurality of the training documents, and the classification labels are target category labels or non-target category labels.
第四方面,本申请实施例提供了一种文档分类的装置,利用上述第一方面任一项所述的文档分类模型训练的方法所训练的文档分类模型,该装置包括:In a fourth aspect, an embodiment of the present application provides an apparatus for document classification, and a document classification model trained by the method for training a document classification model according to any one of the above-mentioned first aspects, the apparatus includes:
第二学习获得单元,用于基于待分类文档中词语的上下文、所述词语的向量和所述待分类文档的标识,利用无监督学习算法学习获得所述待分类文档的特征向量;The second learning and obtaining unit is configured to use an unsupervised learning algorithm to learn and obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word, and the identification of the document to be classified;
文档分类单元,用于将所述待分类文档的特征向量输入所述文档分类模型进行文档分类。The document classification unit is used to input the feature vector of the document to be classified into the document classification model for document classification.
与现有技术相比,本申请至少具有以下优点:Compared with the prior art, this application has at least the following advantages:
采用本申请实施例的技术方案,首先,基于多个训练文档中词语的上下文、词语的向量和多个训练文档的标识,利用无监督学习算法获得多个训练文档的特征向量;然后,基于多个训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型;其中,分类标签为目标类别标签或非目标类别标签。由此可见,将训练文档中词语的上下文和文档的标识作为输入,将词语的向量作为输出,基于无监督算法提取训练文档的特征向量,考虑了词语的上下文语境以及同一文档中上下文语境之间的关联性,提高了训练文档的特征向量的通用性;使得训练获得的文档分类模型对未标记分类标签的文档的实际分类效果较好,从而提高文档分类模型的分类准确率。Using the technical solutions of the embodiments of the present application, first, based on the context of the words in the multiple training documents, the word vectors, and the identifications of the multiple training documents, an unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; The feature vector and classification label of each training document are trained using a binary classification algorithm to obtain a document classification model; where the classification label is a target category label or a non-target category label. It can be seen that the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output. The feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document. The correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by the training has a better actual classification effect on the documents with unlabeled classification tags, thereby improving the classification accuracy of the document classification model.
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。The above description is only an overview of the technical solution of this application. In order to understand the technical means of this application more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other purposes, features and advantages of this application more obvious and understandable. , The following specifically cite the specific implementation of this application.
附图说明Description of the drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those of ordinary skill in the art. The drawings are only used for the purpose of illustrating the preferred embodiments, and are not considered as a limitation to the application. Also, throughout the drawings, the same reference symbols are used to denote the same components. In the attached picture:
图1示出了本申请实施例中一种应用场景所涉及的系统框架示意图;FIG. 1 shows a schematic diagram of a system framework involved in an application scenario in an embodiment of the present application;
图2示出了本申请实施例提供的一种文档分类模型训练的方法的流程示 意图;Fig. 2 shows a schematic flowchart of a method for training a document classification model provided by an embodiment of the present application;
图3示出了本申请实施例提供的另一种文档分类模型训练的方法的流程示意图;FIG. 3 shows a schematic flowchart of another method for training a document classification model provided by an embodiment of the present application;
图4示出了本申请实施例提供的一种文档分类的方法的流程示意图;FIG. 4 shows a schematic flowchart of a method for document classification according to an embodiment of the present application;
图5示出了本申请实施例提供的一种文档分类模型训练的装置的结构示意图;FIG. 5 shows a schematic structural diagram of a device for training a document classification model provided by an embodiment of the present application;
图6示出了本申请实施例提供的一种文档分类的装置的结构示意图。Fig. 6 shows a schematic structural diagram of a document classification device provided by an embodiment of the present application.
具体实施方式detailed description
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the drawings show exemplary embodiments of the present disclosure, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
随着油气行业数字化的推进,对于积累的海量油气领域文档而言,为了实现其充分、高效利用,需要基于专业标签体系,为各个油气领域文档标记合理的类别标签,例如,勘探、开发、钻井、测井、建设等类别标签。现阶段,注重领域文档中词语的词频,一般采用词袋模型、TF-IDF权值计算等经典方法提取领域文档的分类特征。但是,发明人经过研究发现,这些方法忽略了领域文档中词语的顺序与语境,导致领域文档的分类特征不具有通用性的特点,使得训练获得的文档分类模型容易出现过拟合现象,对未标记分类标签的领域文档的实际分类效果较差。即,目前领域文档的分类特征提取方式导致训练获得文档分类模型的分类准确率较低。With the advancement of digitization in the oil and gas industry, in order to achieve full and efficient use of the accumulated massive oil and gas field documents, it is necessary to mark reasonable category labels for each oil and gas field based on a professional labeling system, such as exploration, development, and drilling. , Logging, construction and other category labels. At this stage, focus is on the word frequency of words in domain documents, and classical methods such as bag-of-words model and TF-IDF weight calculation are generally used to extract classification features of domain documents. However, the inventor found through research that these methods ignore the order and context of the words in the domain document, resulting in the classification features of the domain document not being universal, making the document classification model obtained by training prone to overfitting. The actual classification effect of the domain documents without the classification label is poor. That is, the current classification feature extraction method of domain documents leads to low classification accuracy of the document classification model obtained by training.
文档的分类在文档查询、文档聚类管理和文档推荐等领域都有很大的应用价值。确定文档分类是这些领域中的上游任务,其对于下游的文档处理任务提供数据支持,如果文档分类不准确,则会进一步影响后续文档处理效果。The classification of documents has great application value in the fields of document query, document cluster management and document recommendation. Determining document classification is an upstream task in these areas. It provides data support for downstream document processing tasks. If the document classification is not accurate, it will further affect the subsequent document processing effect.
为了解决这一问题,在本申请实施例中,首先,基于多个训练文档中词语的上下文、词语的向量和多个训练文档的标识,利用无监督学习算法获得 多个训练文档的特征向量;然后,基于多个训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型;其中,分类标签为目标类别标签或非目标类别标签。由此可见,将训练文档中词语的上下文和文档的标识作为输入,将词语的向量作为输出,基于无监督算法提取训练文档的特征向量,考虑了词语的上下文语境以及同一文档中上下文语境之间的关联性,提高了训练文档的特征向量的通用性;使得训练获得的文档分类模型对未标记分类标签的文档的实际分类效果较好,从而提高文档分类模型的分类准确率。例如,在文档推荐的领域中,用户阅读某篇文档后,可以为其推荐属于相同分类的其他文档。本发明的方案可以预先对互联网中存在的海量文档进行准确分类,在海量文档中查找与用户所阅读文档分类相同的目标文档,将该目标文档推荐给用户,让推送的文档更加准确地符合用户的偏好。再例如,本发明还可以在论文网站对论文进行聚类和管理。本发明的应用领域有很多,本发明在这些领域中准确对文档进行分类的同时,还能够避免因为人工提取关键词而导致占用人力物力的问题。In order to solve this problem, in the embodiment of the present application, first, based on the context of the words in the multiple training documents, the word vectors, and the identification of the multiple training documents, an unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; Then, based on the feature vectors and classification labels of multiple training documents, a binary classification algorithm is used to train to obtain a document classification model; where the classification labels are target category labels or non-target category labels. It can be seen that the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output. The feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document. The correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by the training has a better actual classification effect on the documents with unlabeled classification tags, thereby improving the classification accuracy of the document classification model. For example, in the field of document recommendation, users can recommend other documents belonging to the same category after reading a certain document. The solution of the present invention can accurately classify the massive documents existing on the Internet in advance, find the target documents with the same classification as the documents read by the user among the massive documents, recommend the target documents to the users, and make the pushed documents more accurately conform to the users Preferences. For another example, the present invention can also cluster and manage papers on the paper website. There are many application fields of the present invention. While accurately classifying documents in these fields, the present invention can also avoid the problem of manpower and material resources due to manual extraction of keywords.
举例来说,本申请实施例的场景之一,可以是应用到如图1所示的场景中,该场景包括终端101和处理器102,其中,终端101可以是PC机,也可以是其它的移动终端,如手机或平板电脑等。用户通过终端101确定多个训练文档发送至处理器102;处理器102采用本申请实施例的实施方式中第一步骤获得多个训练文档的特征向量;处理器102采用本申请实施例的实施方式中第二步骤获得文档分类模型。For example, one of the scenarios in the embodiment of the present application can be applied to the scenario shown in FIG. 1. The scenario includes a terminal 101 and a processor 102, where the terminal 101 may be a PC or other Mobile terminals, such as mobile phones or tablets, etc. The user determines through the terminal 101 that multiple training documents are sent to the processor 102; the processor 102 uses the first step in the embodiment of the present application to obtain the feature vectors of the multiple training documents; the processor 102 uses the implementation of the embodiment of the present application In the second step, the document classification model is obtained.
可以理解的是,在上述应用场景中,虽然将本申请实施方式的动作描述由处理器102执行,但是这些动作也可以由终端101执行,或者还可以部分由终端101执行、部分由处理器102执行。本申请在执行主体方面不受限制,只要执行了本申请实施方式所公开的动作即可。It can be understood that, in the above application scenarios, although the actions in the embodiments of the present application are described as being executed by the processor 102, these actions may also be executed by the terminal 101, or partly executed by the terminal 101 and partly executed by the processor 102. carried out. This application is not limited in terms of the execution subject, as long as the actions disclosed in the embodiments of this application are executed.
可以理解的是,上述场景仅是本申请实施例提供的一个场景示例,本申请实施例并不限于此场景。It is understandable that the foregoing scenario is only an example of a scenario provided in an embodiment of the present application, and the embodiment of the present application is not limited to this scenario.
下面结合附图,通过实施例来详细说明本申请实施例中文档分类模型训练的方法和相关装置的具体实现方式。The method for training the document classification model and the specific implementation of related devices in the embodiments of the present application will be described in detail below with reference to the accompanying drawings and embodiments.
示例性方法Exemplary method
参见图2,示出了本申请实施例中一种文档分类模型训练的方法的流程示意图。在本实施例中,所述方法例如可以包括以下步骤:Refer to FIG. 2, which shows a schematic flowchart of a method for training a document classification model in an embodiment of the present application. In this embodiment, the method may include the following steps, for example:
步骤201:基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量。Step 201: Based on the context of the words in the multiple training documents, the vectors of the words and the identities of the multiple training documents, use an unsupervised learning algorithm to learn to obtain feature vectors of the multiple training documents.
需要说明的是,在本申请实施例中,文档是指领域文档,尤其是油气领域文档,即,训练文档是指训练领域文档,尤其是指训练油气领域文档。由于采用词袋模型、TF-IDF权值计算等经典方法提取领域文档的分类特征,仅仅只注重领域文档中词语的词频,而忽略了领域文档中词语的顺序与语境,使得提取的领域文档的分类特征不具有通用性的特点,从而利用上述分类特征训练获得的文档分类模型容易过拟合,其实际用于对未标记分类标签的领域文档进行预测分类时分类效果较差。因此,在本申请实施例中,需要考虑文档中词语的上下文语境以及同一文档中上下文语境之间的关联性,则利用无监督学习算法学习训练文档中词语的上下文、词语的向量以及训练文档的标识,以便获得具有通用性特点的训练文档的特征向量。It should be noted that, in the embodiments of the present application, the document refers to a domain document, especially an oil and gas domain document, that is, a training document refers to a training domain document, especially a training oil and gas domain document. Since the classical methods such as bag-of-words model and TF-IDF weight calculation are used to extract the classification features of domain documents, only the word frequency of the words in the domain documents is paid attention to, and the order and context of the words in the domain documents are ignored, making the extracted domain documents The categorization feature does not have the characteristics of universality, so the document classification model obtained by using the above-mentioned categorization feature training is easy to overfit, and its classification effect is poor when it is actually used to predict and categorize domain documents with unlabeled classification tags. Therefore, in the embodiments of the present application, it is necessary to consider the context of the words in the document and the correlation between the contexts in the same document, and then use the unsupervised learning algorithm to learn the context of the words in the training document, the vector of the words, and the training The identification of the document in order to obtain the feature vector of the training document with general characteristics.
需要说明的是,基于训练文档中词语的上下文、词语的向量以及训练文档的标识进行无监督学习算法,实际上直接获得是训练文档中每个词语的特征向量,对于训练文档中各个词语的特征向量而言,需要进行融合才能得到训练文档的特征向量。因此,在本申请实施例一种可选的实施方式中,所述步骤201例如可以包括以下步骤:It should be noted that the unsupervised learning algorithm is based on the context of the words in the training document, the word vector, and the identification of the training document. In fact, the feature vector of each word in the training document is directly obtained. For the characteristics of each word in the training document For vectors, fusion is needed to get the feature vector of the training document. Therefore, in an optional implementation manner of the embodiment of the present application, the step 201 may include the following steps, for example:
步骤A:基于每个所述训练文档中每个所述词语的上下文、每个所述词语的向量和对应所述训练文档的标识,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量。Step A: Based on the context of each word in each training document, the vector of each word, and the identifier corresponding to the training document, use an unsupervised learning algorithm to learn to obtain each word in each training document. Feature vectors of the words.
其中,步骤A具体实施时,首先,需要获得训练文档中各个词语,才能明确训练文档中每个词语的上下文、每个词语的向量;而训练文档中的各个词语是利用分词工具对训练文档进行分词得到的;然后,由于无监督学习实际上是为了学习训练文档中每个词语的上下文语境以及同一训练文档中上下文语境之间的关联性,则需要以训练文档中每个词语的上下文和该训练文档 的标识作为输入,以每个词语的向量作为输出,进行无监督学习以便获得训练文档中每个词语的特征向量。因此,在本申请实施例一种可选的实施方式中,所述步骤A例如可以包括以下步骤:Among them, when step A is specifically implemented, firstly, each word in the training document needs to be obtained to clarify the context of each word in the training document and the vector of each word; and each word in the training document is used to perform word segmentation on the training document. Word segmentation is obtained; then, because unsupervised learning is actually to learn the context of each word in the training document and the correlation between the context of the same training document, you need to train the context of each word in the document And the identification of the training document is used as input, and the vector of each word is used as output, and unsupervised learning is performed to obtain the feature vector of each word in the training document. Therefore, in an optional implementation manner of the embodiment of the present application, the step A may include the following steps, for example:
步骤A1:利用分词工具对每个所述训练文档进行分词获得每个所述训练文档中各个所述词语。Step A1: Use a word segmentation tool to segment each of the training documents to obtain each of the words in each of the training documents.
步骤A1具体实施时,对于训练领域文档而言,通常需要先引入领域专业词典结合分词工具对训练领域文档进行分词,例如,基于油气领域专业词典结合分词工具对训练油气领域文档进行分词。When step A1 is specifically implemented, for training domain documents, it is usually necessary to first introduce domain professional dictionaries combined with word segmentation tools to segment the training domain documents. For example, based on the oil and gas domain professional dictionaries combined with word segmentation tools to segment the training oil and gas domain documents.
步骤A2:针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量。Step A2: For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn to obtain each The feature vector of each of the words in the training document.
步骤A2具体实施时,实际上是预先设置一个初始神经网络模型,例如,单隐层初始神经网络模型,该初始神经网络模型包括初始化模型参数。以每个词语的上下文和文档的标识为输入,每个词语的向量为输出,利用无监督学习算法训练该初始神经网络模型,实际上是训练该初始神经网络模型包括的初始化模型参数;在每个词语完成训练后,可以基于初始神经网络模型训练后的模型参数得到每个词语的特征向量。因此,在本申请实施例一种可选的实施方式中,所述步骤A2例如可以包括:When step A2 is specifically implemented, an initial neural network model is actually set in advance, for example, a single hidden layer initial neural network model, and the initial neural network model includes initialization model parameters. Taking the context of each word and the identification of the document as the input, and the vector of each word as the output, the initial neural network model is trained using an unsupervised learning algorithm, which is actually the initialization model parameters included in the training of the initial neural network model; After each word is trained, the feature vector of each word can be obtained based on the model parameters after the initial neural network model training. Therefore, in an optional implementation manner of the embodiment of the present application, the step A2 may include, for example:
步骤A21:针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法训练初始神经网络模型;Step A21: For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and train an initial neural network model by using an unsupervised learning algorithm;
步骤A22:基于所述初始神经网络模型训练后的模型参数,获得每个所述训练文档中每个所述词语的特征向量。Step A22: Obtain a feature vector of each word in each training document based on the model parameters after the initial neural network model training.
步骤B:融合每个所述训练文档中各个所述词语的特征向量,获得每个所述训练文档的特征向量。Step B: Fusion feature vectors of each of the words in each training document to obtain a feature vector of each training document.
步骤B具体实施时,可以预先设置向量融合公式作为预设向量融合公式,将每个训练文档中各个词语的特征向量代入预设向量融合公式获得融合特征向量作为每个训练文档的特征向量。当然,在本申请实施例中,还可以采用其他具体实施方式执行步骤B,只要实现各个词语的特征向量的融合即可。When step B is specifically implemented, the vector fusion formula can be preset as the preset vector fusion formula, and the feature vector of each word in each training document is substituted into the preset vector fusion formula to obtain the fusion feature vector as the feature vector of each training document. Of course, in the embodiment of the present application, other specific implementation manners may also be used to perform step B, as long as the fusion of the feature vectors of various words is achieved.
还需要说明的是,特征向量的维度是可以预先设置的,考虑到特征向量的维度不宜过大,可以基于多个训练文档的总数量对其进行设置,一般的,将特征向量的维度设置为小于多个训练文档的总数量,尤其是当特征向量的维度远小于多个训练文档的总数量,即,多个训练文档的总数量与特征向量的维度的差值大于一定预设差值,可以大大降低特征向量的维度,提高获取文档的特征向量的效率。因此,在本申请实施例一种可选的实施方式中,所述特征向量的维度小于多个所述训练文档的总数量。It should also be noted that the dimension of the feature vector can be set in advance. Considering that the dimension of the feature vector should not be too large, it can be set based on the total number of multiple training documents. Generally, the dimension of the feature vector is set to Less than the total number of multiple training documents, especially when the dimension of the feature vector is much smaller than the total number of multiple training documents, that is, the difference between the total number of multiple training documents and the dimension of the feature vector is greater than a certain preset difference, The dimension of the feature vector can be greatly reduced, and the efficiency of obtaining the feature vector of the document can be improved. Therefore, in an optional implementation manner of the embodiment of the present application, the dimension of the feature vector is smaller than the total number of multiple training documents.
另需要说明的是,步骤201中的无监督学习算法可以为常见的doc2vec算法或word2vec算法,其中,doc2vec算法是word2vec算法的拓展。It should also be noted that the unsupervised learning algorithm in step 201 can be a common doc2vec algorithm or a word2vec algorithm, where the doc2vec algorithm is an extension of the word2vec algorithm.
步骤202:基于多个所述训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型,所述分类标签为目标类别标签或非目标类别标签。Step 202: Based on the feature vectors and classification labels of the multiple training documents, a binary classification algorithm is used to train to obtain a document classification model, where the classification labels are target category labels or non-target category labels.
需要说明的是,步骤201中获得的训练文档的特征向量需要用于训练文档分类模型,训练文档需要标记目标类别标签或非目标类别标签作为分类标签,利用二分类算法对多个训练文档的特征向量和分类标签进行训练,即可得到二分类模型作为文档分类模型。It should be noted that the feature vector of the training document obtained in step 201 needs to be used to train the document classification model, and the training document needs to be marked with the target category label or the non-target category label as the classification label, and the two classification algorithm is used to analyze the characteristics of multiple training documents. Vectors and classification labels are trained to obtain a binary classification model as a document classification model.
其中,目标类别标签例如可以为“勘探”类别标签、“开发”类别标签、“钻井”类别标签、“测井”类别标签或“建设”类别标签等专业标签体系下各种类别标签。Among them, the target category label may be, for example, various category labels under a professional label system such as "exploration" category label, "development" category label, "drilling" category label, "logging" category label, or "construction" category label.
还需要说明的是,标记目标类别标签的训练文档作为正样本,标记非目标类别标签的训练文档作为负样本,由于标记目标类别标签的训练文档可能在多个训练文档中所占比例较小,而标记非目标类别标签的训练文档在多个训练文档中所占比例较大;则对于多个训练文档而言,存在正负样本不均衡问题,一定程度上影响文档分类模型的训练效果。因此,在本申请实施例中,可以预先设置一个合理的正负样本比例作为预设正负样本比例,对多个训练文档中目标类别标签与非目标类别标签的比例进行调整,以匹配预设正负样本比例。在本申请实施例一种可选的实施方式中,例如还可以包括步骤:基于预设正负样本比例调整多个所述训练文档或所述新训练文档中所述目标类别标签与所述非目标类别标签的比例。该方式解决多个训练文档中正负样本 不均衡问题,使得文档分类模型训练过程中对正样本训练更加充分,从而提高训练获得的文档分类模型的分类准确率。It should also be noted that the training documents marked with target category labels are taken as positive samples, and the training documents marked with non-target category labels are taken as negative samples. Since the training documents marked with target category labels may account for a small proportion of multiple training documents, However, training documents with non-target category labels account for a large proportion of multiple training documents; for multiple training documents, there is a problem of imbalance between positive and negative samples, which affects the training effect of the document classification model to a certain extent. Therefore, in this embodiment of the application, a reasonable ratio of positive and negative samples can be preset as the preset ratio of positive and negative samples, and the ratio of target category labels and non-target category labels in multiple training documents can be adjusted to match the preset ratio. The ratio of positive and negative samples. In an optional implementation manner of the embodiment of the present application, for example, it may further include the step of: adjusting multiple training documents or the target category labels in the new training documents based on a preset ratio of positive and negative samples and the non- The proportion of target category labels. This method solves the problem of the imbalance of positive and negative samples in multiple training documents, and makes the training of the positive samples in the document classification model training process more fully, thereby improving the classification accuracy of the document classification model obtained by training.
具体实施时,可以基于欠采样方法或者过采样方法调整多个训练文档中目标类别标签与非目标类别标签的比例,以满足预设正负样本比例。其中,欠采样方法是指对多个训练文档中负样本进行抽样,以降低负样本的数量,即,对多个训练文档中标记非目标类别标签的训练文档进行抽样降低数量;过采样方法是指对多个训练文档中正样本进行重复,以增加正样本的数量,即,对多个训练文档中标记目标类别标签的训练文档进行重复增加数量。During specific implementation, the ratio of target category labels to non-target category labels in multiple training documents can be adjusted based on the under-sampling method or the over-sampling method, so as to meet the preset ratio of positive and negative samples. Among them, the under-sampling method refers to sampling the negative samples in multiple training documents to reduce the number of negative samples, that is, sampling and reducing the number of training documents marked with non-target category labels in multiple training documents; the over-sampling method is Refers to repeating positive samples in multiple training documents to increase the number of positive samples, that is, repeating and increasing the number of training documents marked with target category labels in multiple training documents.
通过本实施例提供的各种实施方式,基于多个训练文档中词语的上下文、词语的向量和多个训练文档的标识,利用无监督学习算法获得多个训练文档的特征向量;然后,基于多个训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型;其中,分类标签为目标类别标签或非目标类别标签。由此可见,将训练文档中词语的上下文和文档的标识作为输入,将词语的向量作为输出,基于无监督算法提取训练文档的特征向量,考虑了词语的上下文语境以及同一文档中上下文语境之间的关联性,提高了训练文档的特征向量的通用性;使得训练获得的文档分类模型对未标记分类标签的文档的实际分类效果较好,从而提高文档分类模型的分类准确率。Through the various implementation manners provided in this embodiment, based on the context of the words in the multiple training documents, the word vectors and the identification of the multiple training documents, the unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; The feature vector and classification label of each training document are trained using a binary classification algorithm to obtain a document classification model; where the classification label is a target category label or a non-target category label. It can be seen that the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output. The feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document. The correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by the training has a better actual classification effect on the documents with unlabeled classification tags, thereby improving the classification accuracy of the document classification model.
需要说明的是,面对海量的文档,通过人工标注的方式给大量文档标记分类标签,耗费较大的人力物力,且浪费较多时间,因而,未标注分类标签的文档的数量远远超过标注分类标签的文档的数量,即,多个训练文档的总数量并不多,同样在一定程度上影响文档分类模型的训练效果,则需要基于未标注分类标签的文档扩增得到更多的新训练文档。It should be noted that in the face of a large number of documents, manually labeling a large number of documents with classification tags consumes a lot of manpower and material resources and wastes a lot of time. Therefore, the number of documents without classification tags far exceeds that of annotations. The number of documents with classification labels, that is, the total number of multiple training documents is not large, and it also affects the training effect of the document classification model to a certain extent. You need to augment the documents with unlabeled classification labels to obtain more new training. Documentation.
实际应用中,在上述实施例的基础上,预先获得预测分类标签为目标类别标签或非目标类别标签,且具备预测分类概率的功能的分类模型作为预设分类模型,利用该预设分类模型对一批未标注分类标签的文档进行测试,可获得该批未标注分类标签的文档的预测分类标签和预测分类概率,通过预先设置一个概率阈值作为评估预测分类标签可靠度的预设概率阈值,基于预设概率阈值和预测分类概率,从一批未标注分类标签的文档中筛选预测分类标 签可靠度高的文档,作为新训练文档进行下一轮文档分类模型训练,再次训练实现上述实施例中文档分类模型的再学习。因此,下面结合附图3,通过实施例来详细说明本申请实施例中另一种文档分类模型训练的方法的具体实现方式。In practical applications, on the basis of the above-mentioned embodiment, a classification model with the function of predicting the classification probability is obtained in advance as the target category label or the non-target category label, and the preset classification model is used to A batch of unlabeled documents is tested, and the predicted classification labels and predicted classification probabilities of the batch of unlabeled documents can be obtained. A probability threshold is set in advance as the preset probability threshold for evaluating the reliability of predicted classification labels, based on Preset probability thresholds and predicted classification probabilities, filter documents with high reliability of predicted classification labels from a batch of unlabeled documents, and use them as new training documents for the next round of document classification model training, and retrain to implement the documents in the above embodiment Re-learning of classification models. Therefore, the specific implementation of another document classification model training method in the embodiment of the present application will be described in detail below with reference to FIG. 3 through an embodiment.
参见图3,示出了本申请实施例中另一种文档分类模型训练的方法的流程示意图。在本实施例中,所述方法例如可以包括以下步骤:Referring to FIG. 3, it shows a schematic flowchart of another method for training a document classification model in an embodiment of the present application. In this embodiment, the method may include the following steps, for example:
步骤301:基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量。Step 301: Based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, use an unsupervised learning algorithm to learn to obtain the feature vectors of the multiple training documents.
步骤302:基于多个所述训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型,所述分类标签为目标类别标签或非目标类别标签。Step 302: Based on the feature vectors and classification labels of the multiple training documents, a binary classification algorithm is used to train to obtain a document classification model, where the classification labels are target category labels or non-target category labels.
需要说明的是,本申请实施例中,步骤301-步骤302与步骤201-步骤202相同,其具体实施方式可参考上述实施例中相关说明内容,在此不在赘述。It should be noted that, in the embodiment of the present application, step 301 to step 302 are the same as step 201 to step 202, and the specific implementation manner can refer to the relevant description in the foregoing embodiment, which will not be repeated here.
步骤303:将多个未标记所述分类标签的文档作为未标记文档,基于所述未标记文档和预设分类模型,预测获得多个所述未标记文档的预测分类标签和预测分类概率,所述预设分类模型具备预测分类概率的功能,所述预测分类标签为所述目标类别标签或所述非目标类别标签。Step 303: Taking a plurality of documents that are not marked with the classification labels as unmarked documents, and predicting to obtain the predicted classification labels and predicted classification probabilities of the plurality of unmarked documents based on the unmarked documents and a preset classification model, so The preset classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label.
其中,预设分类模型可以是步骤302中获得的文档分类模型,即,步骤302获得文档分类模型过程中利用的二分类算法需要具备预测分类概率的功能,例如,逻辑回归算法,此情况下继续执行后续步骤304-步骤305,可实现该文档分类模型的自学习。当然,在本申请实施例中,并不限定预设分类模型必须为步骤302中获得的文档分类模型,也可以是其他文档分类模型,只要其得到的预测分类标签为目标类别标签或非目标类别标签,且具备预测分类概率的功能即可。The preset classification model may be the document classification model obtained in step 302, that is, the binary classification algorithm used in the process of obtaining the document classification model in step 302 needs to have the function of predicting classification probability, for example, logistic regression algorithm. In this case, continue Performing the subsequent steps 304 to 305 can realize the self-learning of the document classification model. Of course, in the embodiment of the present application, the preset classification model is not limited to the document classification model obtained in step 302, and it can also be other document classification models, as long as the predicted classification label obtained is a target category label or a non-target category. Labels, and have the function of predicting the probability of classification.
步骤304:筛选所述预测分类概率高于所述预设概率阈值的多个所述未标记文档获得多个新训练文档。Step 304: Screen the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents.
需要说明的是,在步骤304具体实施时,从多个未标记文档中筛选出其预测分类概率高于等于预设概率阈值的未标记文档后,还可以纳入专家的审 核干预,确认其预测分类标签与实际分类标签是否相符,以提高新训练文档的预测分类标签的可靠度。It should be noted that in the specific implementation of step 304, after selecting unlabeled documents whose predicted classification probability is higher than or equal to the preset probability threshold from multiple unlabeled documents, it can also be included in the review intervention of experts to confirm their predicted classification. Whether the label matches the actual classification label, so as to improve the reliability of the predicted classification label of the new training document.
同理,基于上述实施例的说明可知,标记目标类别标签的新训练文档作为正样本,标记非目标类别标签的新训练文档作为负样本,对于多个新训练文档而言,同样可能存在正负样本不均衡问题,也需要基于预设正负样本比例进行调整。因此,在本申请实施例一种可选的实施方式中,例如还可以包括步骤:基于预设正负样本比例调整多个所述新训练文档中所述目标类别标签与所述非目标类别标签的比例。该方式解决多个新训练文档中正负样本不均衡问题,使得文档分类模型训练过程中对正样本训练更加充分,从而提高训练获得的文档分类模型的分类准确率。Similarly, based on the description of the above embodiment, it can be seen that a new training document labeled with a target category label is used as a positive sample, and a new training document labeled with a non-target category label is used as a negative sample. For multiple new training documents, there may also be positive and negative. The problem of sample imbalance also needs to be adjusted based on the preset positive and negative sample ratio. Therefore, in an optional implementation manner of the embodiment of the present application, for example, it may further include the step of adjusting the target category label and the non-target category label in the multiple new training documents based on a preset ratio of positive and negative samples. proportion. This method solves the problem of the imbalance of positive and negative samples in multiple new training documents, and makes the training of the positive samples in the document classification model training process more fully, thereby improving the classification accuracy of the document classification model obtained by training.
步骤305:基于多个所述新训练文档的特征向量和预测分类标签,利用二分类算法再次训练所述文档分类模型。Step 305: Based on the feature vectors and predicted classification labels of the multiple new training documents, use a binary classification algorithm to retrain the document classification model.
通过本实施例提供的各种实施方式,基于多个训练文档中词语的上下文、词语的向量和多个训练文档的标识,利用无监督学习算法获得多个训练文档的特征向量;基于多个训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型;将多个未标记分类标签的文档作为未标记文档,基于未标记文档和预设分类模型,预测获得多个未标记文档的预测分类标签和预测分类概率;筛选预测分类概率高于预设概率阈值的多个未标记文档获得多个新训练文档;基于多个新训练文档的特征向量和预测分类标签,利用二分类算法迭代训练文档分类模型。由此可见,将训练文档中词语的上下文和文档的标识作为输入,将词语的向量作为输出,基于无监督算法提取训练文档的特征向量,考虑了词语的上下文语境以及同一文档中上下文语境之间的关联性,提高了训练文档的特征向量的通用性;使得训练获得的文档分类模型对未标记分类标签的文档的实际分类效果较好;且设计模型再学习方案,通过预设分类模型对未标记文档进行预测,筛选预测分类标签可靠度高的未标记文档扩增为新训练文档再次训练文档分类模型,从而提高文档分类模型的分类准确率。Through the various implementation manners provided in this embodiment, based on the context of the words in the multiple training documents, the vector of the words and the identification of the multiple training documents, the unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; The feature vector and classification label of the document are trained using a binary classification algorithm to obtain a document classification model; multiple documents with unlabeled classification labels are regarded as unlabeled documents, and based on the unlabeled documents and the preset classification model, the number of unlabeled documents is predicted Predict classification labels and predicted classification probabilities; filter multiple unlabeled documents whose predicted classification probabilities are higher than the preset probability threshold to obtain multiple new training documents; based on the feature vectors and predicted classification labels of multiple new training documents, use binary classification algorithm iteration Train a document classification model. It can be seen that the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output. The feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document. The correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by training has a better actual classification effect on the documents with unmarked classification tags; and the model re-learning scheme is designed and the classification model is preset Predict unlabeled documents, screen and predict unlabeled documents with high reliability of classification labels and expand them into new training documents to train the document classification model again, thereby improving the classification accuracy of the document classification model.
还需要说明的是,由于上述实施例训练获得的文档分类模型对未标记分 类标签的文档的实际分类效果较好,实际应用中需要利用该文档分类模型对待分类文档进行文档分类。因此,下面结合附图4,通过实施例来详细说明本申请实施例中一种文档分类的方法的具体实现方式。It should also be noted that, since the document classification model obtained through training in the foregoing embodiment has a better actual classification effect on documents that are not labeled with classification labels, the document classification model needs to be used to classify documents to be classified in practical applications. Therefore, the specific implementation of a method for document classification in an embodiment of the present application will be described in detail below with reference to FIG. 4 through an embodiment.
步骤401:基于待分类文档中词语的上下文、所述词语的向量和所述待分类文档的标识,利用无监督学习算法学习获得所述待分类文档的特征向量。Step 401: Based on the context of the words in the document to be classified, the vector of the words and the identification of the document to be classified, learning to obtain the feature vector of the document to be classified using an unsupervised learning algorithm.
步骤402:将所述待分类文档的特征向量输入所述文档分类模型进行文档分类。Step 402: Input the feature vector of the document to be classified into the document classification model for document classification.
通过本实施例提供的各种实施方式,基于待分类文档中词语的上下文、词语的向量和该待分类文档的标识,利用无监督学习算法获得待分类文档的特征向量;将待分类文档的特征向量输入文档分类模型进行文档分类。由此可见,将待分类文档中词语的上下文和待分类文档的标识作为输入,将词语的向量作为输出,基于无监督算法提取待分类文档的特征向量,考虑了词语的上下文语境以及同一文档中上下文语境之间的关联性,提高了待分类文档的特征向量的通用性;且文档分类模型对未标记分类标签的待分类文档的分类准确率较高,实际分类效果较好。Through the various implementation manners provided in this embodiment, based on the context of the words in the document to be classified, the word vector, and the identification of the document to be classified, an unsupervised learning algorithm is used to obtain the feature vector of the document to be classified; Vector input document classification model for document classification. It can be seen that the context of the word in the document to be classified and the identification of the document to be classified are used as input, and the vector of the word is used as the output. The feature vector of the document to be classified is extracted based on the unsupervised algorithm, taking the context of the word and the same document into account The relevance between the medium context and the context improves the versatility of the feature vector of the document to be classified; and the document classification model has a higher classification accuracy for the document to be classified without the classification label, and the actual classification effect is better.
示例性装置Exemplary device
参见图5,示出了本申请实施例中一种文档分类模型训练的装置的结构示意图。在本实施例中,利用上述实施例训练获得的文档分类模型,所述装置例如具体可以包括:Referring to FIG. 5, there is shown a schematic structural diagram of an apparatus for document classification model training in an embodiment of the present application. In this embodiment, using the document classification model obtained through training in the foregoing embodiment, the device may specifically include, for example:
第一学习获得单元501,用于基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量;The first learning and obtaining unit 501 is configured to use an unsupervised learning algorithm to learn and obtain feature vectors of the multiple training documents based on the context of the words in the multiple training documents, the vectors of the words, and the identifications of the multiple training documents ;
训练获得单元502,用于基于多个所述训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型,所述分类标签为目标类别标签或非目标类别标签。The training obtaining unit 502 is configured to train and obtain a document classification model by using a two-classification algorithm based on the feature vectors and classification labels of a plurality of the training documents, and the classification labels are target category labels or non-target category labels.
在本申请实施例一种可选的实施方式中,所述第一学习获得单元501包括:In an optional implementation manner of the embodiment of the present application, the first learning obtaining unit 501 includes:
学习获得子单元,用于基于每个所述训练文档中每个所述词语的上下文、 每个所述词语的向量和对应所述训练文档的标识,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量;The learning and obtaining subunit is used for learning to obtain each of the said words based on the context of each said word in each said training document, the vector of each said word and the identification corresponding to said training document by using an unsupervised learning algorithm. The feature vector of each of the words in the training document;
融合获得子单元,用于融合每个所述训练文档中各个所述词语的特征向量,获得每个所述训练文档的特征向量。The fusion obtaining subunit is used to fuse the feature vector of each of the words in each of the training documents to obtain the feature vector of each of the training documents.
在本申请实施例一种可选的实施方式中,所述学习获得子单元包括:In an optional implementation manner of the embodiment of the present application, the learning acquisition subunit includes:
分词获得模块,用于利用分词工具对每个所述训练文档进行分词获得每个所述训练文档中各个所述词语;The word segmentation obtaining module is configured to perform word segmentation on each of the training documents by using a word segmentation tool to obtain each of the words in each of the training documents;
学习获得模块,用于针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量。The learning acquisition module is used for each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, and using unsupervised learning algorithm to learn to obtain each training document. A feature vector of each of the words in each of the training documents.
在本申请实施例一种可选的实施方式中,所述学习获得模块包括:In an optional implementation manner of the embodiment of the present application, the learning acquisition module includes:
训练子模块,用于针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法训练初始神经网络模型;The training sub-module is used for each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, and training the initial nerve using an unsupervised learning algorithm Network model
获得子模块,用于基于所述初始神经网络模型训练后的模型参数,获得每个所述训练文档中每个所述词语的特征向量。An obtaining sub-module is used to obtain the feature vector of each word in each training document based on the model parameters after the initial neural network model training.
在本申请实施例一种可选的实施方式中,所述特征向量的维度小于多个所述训练文档的总数量。In an optional implementation manner of the embodiment of the present application, the dimension of the feature vector is less than the total number of multiple training documents.
在本申请实施例一种可选的实施方式中,还包括:In an optional implementation manner of the embodiment of this application, it further includes:
预测获得单元,用于将多个未标记所述分类标签的文档作为未标记文档,基于所述未标记文档和预设分类模型,预测获得多个所述未标记文档的预测分类标签和预测分类概率,所述预设分类模型具备预测分类概率的功能,所述预测分类标签为所述目标类别标签或所述非目标类别标签;A prediction obtaining unit, configured to use a plurality of documents that are not marked with the classification label as an unmarked document, and predictively obtain the predicted classification labels and predicted classifications of the plurality of unmarked documents based on the unmarked document and a preset classification model Probability, the preset classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;
筛选获得单元,用于筛选所述预测分类概率高于所述预设概率阈值的多个所述未标记文档获得多个新训练文档;A screening and obtaining unit, configured to screen the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents;
迭代训练单元,用于基于多个所述新训练文档的特征向量和预测分类标签,利用二分类算法再次训练所述文档分类模型。The iterative training unit is used to train the document classification model again by using a two-classification algorithm based on the feature vectors and predicted classification labels of the multiple new training documents.
在本申请实施例一种可选的实施方式中,还包括:In an optional implementation manner of the embodiment of this application, it further includes:
调整单元,用于基于预设正负样本比例调整多个所述训练文档或所述新 训练文档中所述目标类别标签与所述非目标类别标签的比例。The adjustment unit is configured to adjust the ratio of the target category label to the non-target category label in the multiple training documents or the new training document based on a preset ratio of positive and negative samples.
通过本实施例提供的各种实施方式,首先,基于多个训练文档中词语的上下文、词语的向量和多个训练文档的标识,利用无监督学习算法获得多个训练文档的特征向量;然后,基于多个训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型;其中,分类标签为目标类别标签或非目标类别标签。由此可见,将训练文档中词语的上下文和文档的标识作为输入,将词语的向量作为输出,基于无监督算法提取训练文档的特征向量,考虑了词语的上下文语境以及同一文档中上下文语境之间的关联性,提高了训练文档的特征向量的通用性;使得训练获得的文档分类模型对未标记分类标签的文档的实际分类效果较好,从而提高文档分类模型的分类准确率。Through the various implementation manners provided in this embodiment, first, based on the context of the words in the multiple training documents, the word vectors, and the identification of the multiple training documents, an unsupervised learning algorithm is used to obtain the feature vectors of the multiple training documents; then, Based on the feature vectors and classification labels of multiple training documents, a binary classification algorithm is used to train to obtain a document classification model; where the classification labels are target category labels or non-target category labels. It can be seen that the context of the words in the training document and the identification of the document are used as input, and the vector of the word is used as the output. The feature vector of the training document is extracted based on an unsupervised algorithm, taking into account the context of the word and the context of the same document. The correlation between the training documents improves the versatility of the feature vectors of the training documents; the document classification model obtained by the training has a better actual classification effect on the documents with unlabeled classification tags, thereby improving the classification accuracy of the document classification model.
所述文档分类模型训练的装置包括处理器和存储器,上述第一学习获得单元和训练获得单元等均作为程序单元存储在存储器中,由处理器执行存储在存储器中的上述程序单元来实现相应的功能。The device for document classification model training includes a processor and a memory. The above-mentioned first learning acquisition unit and training acquisition unit are all stored as program units in the memory, and the processor executes the above-mentioned program units stored in the memory to implement corresponding Features.
处理器中包含内核,由内核去存储器中调取相应的程序单元。内核可以设置一个或以上,通过调整内核参数来考虑了词语的上下文语境以及同一文档中上下文语境之间的关联性,提高了文档的特征向量的通用性;使得训练获得的文档分类模型对未标记分类标签的文档的实际分类效果较好,从而提高文档分类模型的分类准确率。The processor contains the kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to one or more. By adjusting the kernel parameters, the context of the word and the correlation between the context of the same document are considered, and the versatility of the feature vector of the document is improved; the document classification model obtained by training is The actual classification effect of the documents without the classification label is better, thereby improving the classification accuracy of the document classification model.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。The memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip.
参见图6,示出了本申请实施例中一种文档分类的装置的结构示意图。在本实施例中,利用上述实施例训练获得的文档分类模型,所述装置例如具体可以包括:Referring to FIG. 6, there is shown a schematic structural diagram of an apparatus for document classification in an embodiment of the present application. In this embodiment, using the document classification model obtained through training in the foregoing embodiment, the device may specifically include, for example:
第二学习获得单元601,用于基于待分类文档中词语的上下文、所述词语的向量和所述待分类文档的标识,利用无监督学习算法学习获得所述待分类文档的特征向量;The second learning and obtaining unit 601 is configured to use an unsupervised learning algorithm to learn and obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word and the identification of the document to be classified;
文档分类单元602,用于将所述待分类文档的特征向量输入所述文档分 类模型进行文档分类。The document classification unit 602 is configured to input the feature vector of the document to be classified into the document classification model for document classification.
通过本实施例提供的各种实施方式,基于待分类文档中词语的上下文、词语的向量和该待分类文档的标识,利用无监督学习算法获得待分类文档的特征向量;将待分类文档的特征向量输入文档分类模型进行文档分类。由此可见,将待分类文档中词语的上下文和待分类文档的标识作为输入,将词语的向量作为输出,基于无监督算法提取待分类文档的特征向量,考虑了词语的上下文语境以及同一文档中上下文语境之间的关联性,提高了待分类文档的特征向量的通用性;且文档分类模型对未标记分类标签的待分类文档的分类准确率较高,实际分类效果较好。Through the various implementation manners provided in this embodiment, based on the context of the words in the document to be classified, the word vector, and the identification of the document to be classified, an unsupervised learning algorithm is used to obtain the feature vector of the document to be classified; Vector input document classification model for document classification. It can be seen that the context of the word in the document to be classified and the identification of the document to be classified are used as input, and the vector of the word is used as the output. The feature vector of the document to be classified is extracted based on the unsupervised algorithm, taking the context of the word and the same document into account The relevance between the medium context and the context improves the versatility of the feature vector of the document to be classified; and the document classification model has a higher classification accuracy for the document to be classified without the classification label, and the actual classification effect is better.
所述文档分类的装置包括处理器和存储器,上述第二学习获得单元和文档分类单元等均作为程序单元存储在存储器中,由处理器执行存储在存储器中的上述程序单元来实现相应的功能。The document classification device includes a processor and a memory. The second learning acquisition unit and the document classification unit are all stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.
处理器中包含内核,由内核去存储器中调取相应的程序单元。内核可以设置一个或以上,通过调整内核参数来考虑了词语的上下文语境以及同一文档中上下文语境之间的关联性,提高了待分类文档的特征向量的通用性;且文档分类模型对未标记分类标签的待分类文档的分类准确率较高,实际分类效果较好。The processor contains the kernel, and the kernel calls the corresponding program unit from the memory. One or more kernels can be set. By adjusting the kernel parameters, the contextual context of words and the relevance between contextual contexts in the same document are considered, and the versatility of the feature vector of the document to be classified is improved; The classification accuracy of the documents to be classified with the classification label is higher, and the actual classification effect is better.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。The memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip.
本申请实施例提供了一种存储介质,其上存储有程序,该程序被处理器执行时实现所述文档分类模型训练的方法或所述文档分类的方法。The embodiment of the present application provides a storage medium on which a program is stored, and when the program is executed by a processor, the method for training the document classification model or the method for document classification is realized.
本申请实施例提供了一种设备,设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序,处理器执行程序时实现以下步骤:An embodiment of the present application provides a device that includes a processor, a memory, and a program stored on the memory and capable of running on the processor, and the processor implements the following steps when the program is executed:
基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量;Based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, learning to obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm;
基于多个所述训练文档的特征向量和分类标签,利用二分类算法训练获 得文档分类模型,所述分类标签为目标类别标签或非目标类别标签。Based on the feature vectors and classification labels of a plurality of the training documents, a document classification model is obtained by training using a binary classification algorithm, and the classification labels are target category labels or non-target category labels.
在本申请实施例一种可选的实施方式中,所述基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量,包括:In an optional implementation manner of the embodiment of the present application, based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, an unsupervised learning algorithm is used to learn to obtain multiple The feature vector of the training document includes:
基于每个所述训练文档中每个所述词语的上下文、每个所述词语的向量和对应所述训练文档的标识,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量;Based on the context of each word in each training document, the vector of each word, and the identification corresponding to the training document, an unsupervised learning algorithm is used to learn to obtain each of the training documents. Feature vector of words;
融合每个所述训练文档中各个所述词语的特征向量,获得每个所述训练文档的特征向量。The feature vector of each of the words in each of the training documents is merged to obtain the feature vector of each of the training documents.
在本申请实施例一种可选的实施方式中,所述基于每个所述训练文档中每个所述词语的上下文、每个所述词语的向量和对应所述训练文档的标识,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量,包括:In an optional implementation manner of the embodiment of the present application, based on the context of each word in each training document, the vector of each word, and the identifier corresponding to the training document, use no The supervised learning algorithm learns to obtain the feature vector of each of the words in each of the training documents, including:
利用分词工具对每个所述训练文档进行分词获得每个所述训练文档中各个所述词语;Use a word segmentation tool to segment each of the training documents to obtain each of the words in each of the training documents;
针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量。For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn from each training document. The feature vector of each of the words.
在本申请实施例一种可选的实施方式中,所述针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量,包括:In an optional implementation manner of the embodiment of the present application, for each training document, the context of each word and the identification of the training document are used as input, and the vector of each word is Output, using an unsupervised learning algorithm to learn to obtain the feature vector of each of the words in each of the training documents, including:
针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法训练初始神经网络模型;For each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, training an initial neural network model by using an unsupervised learning algorithm;
基于所述初始神经网络模型训练后的模型参数,获得每个所述训练文档中每个所述词语的特征向量。Based on the model parameters after the initial neural network model training, a feature vector of each word in each training document is obtained.
在本申请实施例一种可选的实施方式中,所述特征向量的维度小于多个所述训练文档的总数量。In an optional implementation manner of the embodiment of the present application, the dimension of the feature vector is less than the total number of multiple training documents.
在本申请实施例一种可选的实施方式中,在所述利用二分类算法训练获得文档分类模型之后,还包括:In an optional implementation manner of the embodiment of the present application, after the document classification model is obtained by training using a binary classification algorithm, the method further includes:
将多个未标记所述分类标签的文档作为未标记文档,基于所述未标记文档和预设分类模型,预测获得多个所述未标记文档的预测分类标签和预测分类概率,所述预设分类模型具备预测分类概率的功能,所述预测分类标签为所述目标类别标签或所述非目标类别标签;Taking a plurality of documents that are not marked with the classification label as an unmarked document, and predicting and obtaining the predicted classification labels and predicted classification probabilities of the plurality of unmarked documents based on the unmarked document and a preset classification model, the preset The classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;
筛选所述预测分类概率高于所述预设概率阈值的多个所述未标记文档获得多个新训练文档;Screening the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents;
基于多个所述新训练文档的特征向量和预测分类标签,利用二分类算法再次训练所述文档分类模型。Based on the feature vectors and predicted classification labels of the multiple new training documents, the document classification model is retrained by using a binary classification algorithm.
在本申请实施例一种可选的实施方式中,还包括:In an optional implementation manner of the embodiment of this application, it further includes:
基于预设正负样本比例调整多个所述训练文档或所述新训练文档中所述目标类别标签与所述非目标类别标签的比例。Adjusting the ratio of the target category label to the non-target category label in the multiple training documents or the new training document based on a preset ratio of positive and negative samples.
本文中的设备可以是服务器、PC、PAD、手机等。The devices in this article can be servers, PCs, PADs, mobile phones, etc.
或者,处理器执行程序时利用文档分类模型训练的方法所训练的文档分类模型实现以下步骤:Or, when the processor executes the program, the document classification model trained by the method of document classification model training implements the following steps:
基于待分类文档中词语的上下文、所述词语的向量和所述待分类文档的标识,利用无监督学习算法学习获得所述待分类文档的特征向量;Learning to obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word and the identification of the document to be classified, using an unsupervised learning algorithm;
将所述待分类文档的特征向量输入所述文档分类模型进行文档分类。The feature vector of the document to be classified is input into the document classification model for document classification.
本申请还提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行初始化有如下方法步骤的程序:This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:
基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量;Based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, learning to obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm;
基于多个所述训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型,所述分类标签为目标类别标签或非目标类别标签。Based on the feature vectors and classification labels of the multiple training documents, a two-classification algorithm is used to train to obtain a document classification model, and the classification labels are target category labels or non-target category labels.
在本申请实施例一种可选的实施方式中,所述基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量,包括:In an optional implementation manner of the embodiment of the present application, based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, an unsupervised learning algorithm is used to learn to obtain multiple The feature vector of the training document includes:
基于每个所述训练文档中每个所述词语的上下文、每个所述词语的向量和对应所述训练文档的标识,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量;Based on the context of each word in each training document, the vector of each word, and the identification corresponding to the training document, an unsupervised learning algorithm is used to learn to obtain each of the training documents. Feature vector of words;
融合每个所述训练文档中各个所述词语的特征向量,获得每个所述训练文档的特征向量。The feature vector of each of the words in each of the training documents is merged to obtain the feature vector of each of the training documents.
在本申请实施例一种可选的实施方式中,所述基于每个所述训练文档中每个所述词语的上下文、每个所述词语的向量和对应所述训练文档的标识,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量,包括:In an optional implementation manner of the embodiment of the present application, based on the context of each word in each training document, the vector of each word, and the identifier corresponding to the training document, use no The supervised learning algorithm learns to obtain the feature vector of each of the words in each of the training documents, including:
利用分词工具对每个所述训练文档进行分词获得每个所述训练文档中各个所述词语;Use a word segmentation tool to segment each of the training documents to obtain each of the words in each of the training documents;
针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量。For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn from each training document. The feature vector of each of the words.
在本申请实施例一种可选的实施方式中,所述针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量,包括:In an optional implementation manner of the embodiment of the present application, for each training document, the context of each word and the identification of the training document are used as input, and the vector of each word is Output, using an unsupervised learning algorithm to learn to obtain the feature vector of each of the words in each of the training documents, including:
针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法训练初始神经网络模型;For each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, training an initial neural network model by using an unsupervised learning algorithm;
基于所述初始神经网络模型训练后的模型参数,获得每个所述训练文档中每个所述词语的特征向量。Based on the model parameters after the initial neural network model training, a feature vector of each word in each training document is obtained.
在本申请实施例一种可选的实施方式中,所述特征向量的维度小于多个所述训练文档的总数量。In an optional implementation manner of the embodiment of the present application, the dimension of the feature vector is less than the total number of multiple training documents.
在本申请实施例一种可选的实施方式中,在所述利用二分类算法训练获得文档分类模型之后,还包括:In an optional implementation manner of the embodiment of the present application, after the document classification model is obtained by training using a binary classification algorithm, the method further includes:
将多个未标记所述分类标签的文档作为未标记文档,基于所述未标记文档和预设分类模型,预测获得多个所述未标记文档的预测分类标签和预测分 类概率,所述预设分类模型具备预测分类概率的功能,所述预测分类标签为所述目标类别标签或所述非目标类别标签;Taking a plurality of documents that are not marked with the classification label as an unmarked document, and predicting and obtaining the predicted classification labels and predicted classification probabilities of the plurality of unmarked documents based on the unmarked document and a preset classification model, the preset The classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;
筛选所述预测分类概率高于所述预设概率阈值的多个所述未标记文档获得多个新训练文档;Screening the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents;
基于多个所述新训练文档的特征向量和预测分类标签,利用二分类算法再次训练所述文档分类模型。Based on the feature vectors and predicted classification labels of the multiple new training documents, the document classification model is retrained by using a binary classification algorithm.
在本申请实施例一种可选的实施方式中,还包括:In an optional implementation manner of the embodiment of this application, it further includes:
基于预设正负样本比例调整多个所述训练文档或所述新训练文档中所述目标类别标签与所述非目标类别标签的比例。Adjusting the ratio of the target category label to the non-target category label in the multiple training documents or the new training document based on a preset ratio of positive and negative samples.
或者,处理器执行程序时利用文档分类模型训练的方法所训练的文档分类模型实现以下步骤:Or, when the processor executes the program, the document classification model trained by the method of document classification model training implements the following steps:
基于待分类文档中词语的上下文、所述词语的向量和所述待分类文档的标识,利用无监督学习算法学习获得所述待分类文档的特征向量;Learning to obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word and the identification of the document to be classified, using an unsupervised learning algorithm;
将所述待分类文档的特征向量输入所述文档分类模型进行文档分类。The feature vector of the document to be classified is input into the document classification model for document classification.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设 备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。The memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or they also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
以上仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。The above are only examples of the application, and are not used to limit the application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims (10)

  1. 一种文档分类模型训练的方法,其特征在于,包括:A method for document classification model training is characterized in that it includes:
    基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量;Based on the context of the words in the multiple training documents, the vectors of the words and the identifications of the multiple training documents, learning to obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm;
    基于多个所述训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型,所述分类标签为目标类别标签或非目标类别标签。Based on the feature vectors and classification labels of the multiple training documents, a two-classification algorithm is used to train to obtain a document classification model, and the classification labels are target category labels or non-target category labels.
  2. 根据权利要求1所述的方法,其特征在于,所述基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量,包括:The method according to claim 1, characterized in that, based on the context of the words in a plurality of training documents, the vectors of the words and the identifications of the plurality of training documents, an unsupervised learning algorithm is used to learn to obtain a plurality of Describe the feature vector of the training document, including:
    基于每个所述训练文档中每个所述词语的上下文、每个所述词语的向量和对应所述训练文档的标识,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量;Based on the context of each word in each training document, the vector of each word, and the identification corresponding to the training document, an unsupervised learning algorithm is used to learn to obtain each of the training documents. Feature vector of words;
    融合每个所述训练文档中各个所述词语的特征向量,获得每个所述训练文档的特征向量。The feature vector of each of the words in each of the training documents is merged to obtain the feature vector of each of the training documents.
  3. 根据权利要求2所述的方法,其特征在于,所述基于每个所述训练文档中每个所述词语的上下文、每个所述词语的向量和对应所述训练文档的标识,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量,包括:The method according to claim 2, characterized in that, based on the context of each of the words in each of the training documents, the vectors of each of the words, and the identifiers corresponding to the training documents, using unsupervised The learning algorithm learns to obtain the feature vector of each of the words in each of the training documents, including:
    利用分词工具对每个所述训练文档进行分词获得每个所述训练文档中各个所述词语;Use a word segmentation tool to segment each of the training documents to obtain each of the words in each of the training documents;
    针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量。For each training document, take the context of each word and the identification of the training document as input, and the vector of each word as output, and use an unsupervised learning algorithm to learn from each training document. The feature vector of each of the words.
  4. 根据权利要求3所述的方法,其特征在于,所述针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法学习获得每个所述训练文档中每个所述词语的特征向量,包括:The method according to claim 3, wherein, for each of the training documents, the context of each of the words and the identification of the training document are used as input, and the vector of each of the words is used as the output , Using an unsupervised learning algorithm to learn to obtain the feature vector of each of the words in each of the training documents, including:
    针对每个所述训练文档,以每个所述词语的上下文和所述训练文档的标识为输入,每个所述词语的向量为输出,利用无监督学习算法训练初始神经 网络模型;For each training document, taking the context of each word and the identification of the training document as input, and the vector of each word as output, training an initial neural network model by using an unsupervised learning algorithm;
    基于所述初始神经网络模型训练后的模型参数,获得每个所述训练文档中每个所述词语的特征向量。Based on the model parameters after the initial neural network model training, a feature vector of each word in each training document is obtained.
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述特征向量的维度小于多个所述训练文档的总数量。The method according to any one of claims 1 to 4, wherein the dimension of the feature vector is less than the total number of multiple training documents.
  6. 根据权利要求1至5任一项所述的方法,其特征在于,在所述利用二分类算法训练获得文档分类模型之后,还包括:The method according to any one of claims 1 to 5, characterized in that, after the document classification model is obtained by training by using a two-class classification algorithm, the method further comprises:
    将多个未标记所述分类标签的文档作为未标记文档,基于所述未标记文档和预设分类模型,预测获得多个所述未标记文档的预测分类标签和预测分类概率,所述预设分类模型具备预测分类概率的功能,所述预测分类标签为所述目标类别标签或所述非目标类别标签;Taking a plurality of documents that are not marked with the classification label as an unmarked document, and predicting and obtaining the predicted classification labels and predicted classification probabilities of the plurality of unmarked documents based on the unmarked document and a preset classification model, the preset The classification model has a function of predicting classification probability, and the predicted classification label is the target category label or the non-target category label;
    筛选所述预测分类概率高于所述预设概率阈值的多个所述未标记文档获得多个新训练文档;Screening the plurality of unlabeled documents whose predicted classification probability is higher than the preset probability threshold to obtain a plurality of new training documents;
    基于多个所述新训练文档的特征向量和预测分类标签,利用二分类算法再次训练所述文档分类模型。Based on the feature vectors and predicted classification labels of the multiple new training documents, the document classification model is retrained by using a binary classification algorithm.
  7. 根据权利要求1-6任一项所述的方法,其特征在于,还包括:The method according to any one of claims 1-6, further comprising:
    基于预设正负样本比例调整多个所述训练文档或所述新训练文档中所述目标类别标签与所述非目标类别标签的比例。Adjusting the ratio of the target category label to the non-target category label in the multiple training documents or the new training document based on a preset ratio of positive and negative samples.
  8. 一种文档分类的方法,其特征在于,利用如权利要求1至7中任一项所述的文档分类模型训练的方法所训练的文档分类模型,所述方法包括:A method for document classification, characterized in that a document classification model trained by the method for document classification model training according to any one of claims 1 to 7, said method comprising:
    基于待分类文档中词语的上下文、所述词语的向量和所述待分类文档的标识,利用无监督学习算法学习获得所述待分类文档的特征向量;Learning to obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word and the identification of the document to be classified, using an unsupervised learning algorithm;
    将所述待分类文档的特征向量输入所述文档分类模型进行文档分类。The feature vector of the document to be classified is input into the document classification model for document classification.
  9. 一种文档分类模型训练的装置,其特征在于,包括:A device for training a document classification model is characterized in that it comprises:
    第一学习获得单元,用于基于多个训练文档中词语的上下文、所述词语的向量和多个所述训练文档的标识,利用无监督学习算法学习获得多个所述训练文档的特征向量;The first learning obtaining unit is configured to learn and obtain the feature vectors of the multiple training documents by using an unsupervised learning algorithm based on the context of the words in the multiple training documents, the vectors of the words, and the identifications of the multiple training documents;
    训练获得单元,用于基于多个所述训练文档的特征向量和分类标签,利用二分类算法训练获得文档分类模型,所述分类标签为目标类别标签或非目 标类别标签。The training obtaining unit is configured to train to obtain a document classification model based on the feature vectors and classification labels of a plurality of the training documents, and the classification label is a target category label or a non-target category label.
  10. 一种文档分类的装置,其特征在于,利用如权利要求1至7中任一项所述的文档分类模型训练的方法所训练的文档分类模型,所述装置包括:A device for document classification, characterized in that a document classification model trained by the method for training a document classification model according to any one of claims 1 to 7, said device comprising:
    第二学习获得单元,用于基于待分类文档中词语的上下文、所述词语的向量和所述待分类文档的标识,利用无监督学习算法学习获得所述待分类文档的特征向量;The second learning and obtaining unit is configured to use an unsupervised learning algorithm to learn and obtain the feature vector of the document to be classified based on the context of the word in the document to be classified, the vector of the word, and the identification of the document to be classified;
    文档分类单元,用于将所述待分类文档的特征向量输入所述文档分类模型进行文档分类。The document classification unit is used to input the feature vector of the document to be classified into the document classification model for document classification.
PCT/CN2020/097869 2019-09-24 2020-06-24 Method for training document classification model, and related apparatus WO2021057133A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910907014.4A CN112632269A (en) 2019-09-24 2019-09-24 Method and related device for training document classification model
CN201910907014.4 2019-09-24

Publications (1)

Publication Number Publication Date
WO2021057133A1 true WO2021057133A1 (en) 2021-04-01

Family

ID=75165529

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/097869 WO2021057133A1 (en) 2019-09-24 2020-06-24 Method for training document classification model, and related apparatus

Country Status (2)

Country Link
CN (1) CN112632269A (en)
WO (1) WO2021057133A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468326A (en) * 2021-06-16 2021-10-01 北京明略软件系统有限公司 Method and device for determining document classification
CN113569953A (en) * 2021-07-29 2021-10-29 中国工商银行股份有限公司 Training method and device of classification model and electronic equipment
CN114817538A (en) * 2022-04-26 2022-07-29 马上消费金融股份有限公司 Training method of text classification model, text classification method and related equipment
CN115292498A (en) * 2022-08-19 2022-11-04 北京华宇九品科技有限公司 Document classification method, system, computer equipment and storage medium
US20220366301A1 (en) * 2021-05-11 2022-11-17 Sap Se Model-independent confidence value prediction machine learned model
CN115827876A (en) * 2023-01-10 2023-03-21 中国科学院自动化研究所 Method and device for determining unlabeled text and electronic equipment
CN115878793A (en) * 2022-05-25 2023-03-31 北京中关村科金技术有限公司 Multi-label document classification method and device, electronic equipment and medium
WO2024139106A1 (en) * 2022-12-29 2024-07-04 上海智臻智能网络科技股份有限公司 Document representation model training method and apparatus, document representation method and apparatus, electronic device, and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202010A (en) * 2016-07-12 2016-12-07 重庆兆光科技股份有限公司 The method and apparatus building Law Text syntax tree based on deep neural network
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
WO2019035765A9 (en) * 2017-08-14 2019-03-21 Dathena Science Pte. Ltd. Methods, machine learning engines and file management platform systems for content and context aware data classification and security anomaly detection
CN109635107A (en) * 2018-11-19 2019-04-16 北京亚鸿世纪科技发展有限公司 The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source
CN109697285A (en) * 2018-12-13 2019-04-30 中南大学 Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11769072B2 (en) * 2016-08-08 2023-09-26 Adobe Inc. Document structure extraction using machine learning
CN106412563A (en) * 2016-09-30 2017-02-15 珠海市魅族科技有限公司 Image display method and apparatus
CN110019777B (en) * 2017-09-05 2022-08-19 腾讯科技(深圳)有限公司 Information classification method and equipment
CN109753567A (en) * 2019-01-31 2019-05-14 安徽大学 A kind of file classification method of combination title and text attention mechanism
CN110084275A (en) * 2019-03-29 2019-08-02 广州思德医疗科技有限公司 A kind of choosing method and device of training sample

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202010A (en) * 2016-07-12 2016-12-07 重庆兆光科技股份有限公司 The method and apparatus building Law Text syntax tree based on deep neural network
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
WO2019035765A9 (en) * 2017-08-14 2019-03-21 Dathena Science Pte. Ltd. Methods, machine learning engines and file management platform systems for content and context aware data classification and security anomaly detection
CN109635107A (en) * 2018-11-19 2019-04-16 北京亚鸿世纪科技发展有限公司 The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source
CN109697285A (en) * 2018-12-13 2019-04-30 中南大学 Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220366301A1 (en) * 2021-05-11 2022-11-17 Sap Se Model-independent confidence value prediction machine learned model
CN113468326A (en) * 2021-06-16 2021-10-01 北京明略软件系统有限公司 Method and device for determining document classification
CN113569953A (en) * 2021-07-29 2021-10-29 中国工商银行股份有限公司 Training method and device of classification model and electronic equipment
CN114817538A (en) * 2022-04-26 2022-07-29 马上消费金融股份有限公司 Training method of text classification model, text classification method and related equipment
CN114817538B (en) * 2022-04-26 2023-08-08 马上消费金融股份有限公司 Training method of text classification model, text classification method and related equipment
CN115878793A (en) * 2022-05-25 2023-03-31 北京中关村科金技术有限公司 Multi-label document classification method and device, electronic equipment and medium
CN115878793B (en) * 2022-05-25 2023-08-25 北京中关村科金技术有限公司 Multi-label document classification method, device, electronic equipment and medium
CN115292498A (en) * 2022-08-19 2022-11-04 北京华宇九品科技有限公司 Document classification method, system, computer equipment and storage medium
WO2024139106A1 (en) * 2022-12-29 2024-07-04 上海智臻智能网络科技股份有限公司 Document representation model training method and apparatus, document representation method and apparatus, electronic device, and computer readable storage medium
CN115827876A (en) * 2023-01-10 2023-03-21 中国科学院自动化研究所 Method and device for determining unlabeled text and electronic equipment

Also Published As

Publication number Publication date
CN112632269A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
WO2021057133A1 (en) Method for training document classification model, and related apparatus
CN109471938B (en) Text classification method and terminal
US10235446B2 (en) Systems and methods for organizing data sets
Gui et al. Negative transfer detection in transductive transfer learning
Dhinakaran et al. App review analysis via active learning: reducing supervision effort without compromising classification accuracy
CN107004159B (en) Active machine learning
US20230052903A1 (en) System and method for multi-task lifelong learning on personal device with improved user experience
US20170344822A1 (en) Semantic representation of the content of an image
WO2015180622A1 (en) Method and apparatus for determining categorical attribute of queried word in search
Paramesh et al. Classifying the unstructured IT service desk tickets using ensemble of classifiers
US20230214679A1 (en) Extracting and classifying entities from digital content items
CN110598869B (en) Classification method and device based on sequence model and electronic equipment
CN112528010A (en) Knowledge recommendation method and device, computer equipment and readable storage medium
Sun et al. Active learning SVM with regularization path for image classification
Murty et al. Dark web text classification by learning through SVM optimization
Manne et al. Text categorization with K-nearest neighbor approach
CN111061870B (en) Article quality evaluation method and device
Alabdulkarim et al. Exploring Sentiment Analysis on Social Media Texts
Bahrami et al. Automatic image annotation using an evolutionary algorithm (IAGA)
Desale et al. Fake review detection with concept drift in the data: a survey
CN109284376A (en) Cross-cutting news data sentiment analysis method based on domain-adaptive
Akujuobi et al. Mining top-k popular datasets via a deep generative model
Zieba et al. Beta-boosted ensemble for big credit scoring data
Renuse et al. Multi label learning and multi feature extraction for automatic image annotation
Gebeyehu et al. A two step data mining approach for amharic text classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20869791

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20869791

Country of ref document: EP

Kind code of ref document: A1