CN112084338B

CN112084338B - Automatic document classification method, system, computer equipment and storage medium

Info

Publication number: CN112084338B
Application number: CN202010983960.XA
Authority: CN
Inventors: 侯聪; 陈运文; 纪达麒; 韩伟; 白良俊; 文敏
Original assignee: Daguan Data Chengdu Co ltd
Current assignee: Daguan Data Chengdu Co ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2024-02-06
Anticipated expiration: 2040-09-18
Also published as: CN112084338A

Abstract

The invention discloses an automatic document classification method, an automatic document classification system, computer equipment and a storage medium, wherein the automatic document classification method trains a language model according to similar text data: the method comprises the steps of pre-training in two layers, firstly training a language model through unlabeled text data, and then training to obtain a semantic encoder on the basis of the language model according to labeled data, namely similar text data; and then, archiving and classifying based on the semantic encoder: based on the semantic encoder, the text is classified on the small dataset using an unsupervised approach, using the nearest neighbor idea. The universal semantic encoder is trained on a large amount of data in the universal field, so that semantics can be effectively encoded, additional training can be omitted on a very small data set of a new actual scene, and the generalization capability is poor due to the phenomenon of overfitting. The added and deleted documents or the modified classification system can be validated only by changing the related documents or classification, and the model is not required to be retrained, so that the timeliness is good.

Description

Automatic document classification method, system, computer equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method, a system, a computer device, and a storage medium for automatically classifying documents.

Background

Document classification systems belong to a text classification application that automatically classifies input text under a certain classification by a model. Document categorization is often applied in a specific, relatively narrow, professional area, thus facing two problems:

1. the labeling data is less. Many fields of application are difficult to collect data in the public field due to factors such as strong professional or confidentiality requirements, and when the application is performed, part of the categories even comprise only a few documents;

2. in practical application, users can dynamically add and delete data and even change classification systems in the using process.

Common classification models based on supervised learning classification, such as Fasttext, textCNN and BERT-based text classification applications, all employ the same model: training data is acquired aiming at a classification system, then a model is trained, and document classification prediction can be performed after training is finished. This scheme is applicable in the scenario of having fully annotated data; however, the following disadvantages exist when such a scheme is directly applied to document classification in the professional field with sparse labeling data:

1. under the condition of small data volume, fitting is very easy, so that the generalization capability of the model is reduced, and the prediction effect is poor;

2. under the condition of dynamically adding and deleting data or a classification system, frequent training is needed to change the model, and great pressure is caused on hardware.

Disclosure of Invention

In order to solve the problems, the invention provides an automatic document classifying method, an automatic document classifying system, computer equipment and a storage medium, wherein a pre-training model semantic model with strong generalization capability and based on a large amount of data is applied to a small data scene, so that overfitting is avoided; meanwhile, the concept of nearest neighbor clustering is applied to search classification, and a new training model with heavy frequency in practical application is avoided.

The invention discloses an automatic document classifying method, which comprises the following steps:

s1, training a language model according to similar text data: the method comprises the steps of pre-training in two layers, firstly training a language model through unlabeled text data, and then training to obtain a semantic encoder on the basis of the language model according to labeled data, namely similar text data;

s2, archiving and classifying based on semantic encoders: based on the semantic encoder, the text is classified on the small dataset using an unsupervised approach, using the nearest neighbor idea.

Further, step S1 comprises the sub-steps of:

s101, training by adopting unlabeled text data: selecting a language model based on a self-attention framework, and training the language model on unlabeled text data to enable the language model to learn the common sense of a target language;

s102, training by adopting similar text data: obtaining similar text data in the general field to form similar text pairs comprising anchor texts and similar texts, and randomly obtaining a dissimilar text in a corpus aiming at each similar text pair to form training data comprising the anchor texts, the similar texts and the dissimilar texts, so that the anchor texts and the similar texts are semantically related and are semantically unrelated; training a plurality of pieces of training data based on the language model, respectively inputting anchor point text, similar text and dissimilar text into the same language model, and respectively obtaining vectors V representing respective semantics _a ,V _p ,V _n And then, calculating a ternary loss function to obtain loss, and retraining the language model to obtain the semantic encoder.

Further, the expression of the ternary loss function is as follows:

loss＝max{||V _a -V _p || ₂ -||V _a -V _n || ₂ +margin，0}

wherein loss is loss, ||V _a -V _p || ₂ Represents V _a And V _p The distance in the space is such that,||V _a -V _n || ₂ represents V _a And V _n Distance in space, margin is a constant, representing a desired spatial distance; the ternary loss function may pull the distance between anchor text and similar text, and distance between anchor text and dissimilar text.

Further, the semantic encoder can learn the capability of performing semantic encoding on the text, namely, the closer the semantic vectors obtained by encoding the more similar text through the semantic encoder are in space, the farther the semantic vectors are in space, otherwise, the semantic vectors are in space.

Further, the obtaining similar text data in the general field includes: and crawling similar text recommendation information of the website through the crawler.

Further, step S2 comprises the sub-steps of:

s201, constructing a classification system and uploading a plurality of documents to each classification: uploading the documents to the semantic encoder according to classification, and encoding each document by the semantic encoder to obtain a semantic vector and storing the semantic vector into a vector database according to a classification system; the semantic vector under each category forms a vector set, and the cluster center of the vector set is calculated to be used as the feature vector of the category;

s202, classifying new documents: uploading a document to be classified by a user, and carrying out semantic vector coding by the semantic coder to obtain a semantic vector of the document to be classified; searching a feature vector closest to the semantic vector space of the document to be classified in the vector database, and classifying the document to be classified into a class corresponding to the feature vector.

Further, the user can modify the classification system or the classified document, and the feature vector of the modified classification changes accordingly.

The invention relates to an automatic document classification system, which comprises a semantic encoder, a vector database and a vector retrieval module;

the semantic encoder is trained on the basis of a language model according to unlabeled text data and similar text data, and is used for storing semantic vectors obtained by encoding classified documents into the vector database according to a classification system and outputting the semantic vectors obtained by encoding the documents to be classified to the vector retrieval module;

in the vector database, semantic vectors under each category form a vector set, and the cluster center of the vector set is used as the characteristic vector of the category;

the vector retrieval module is used for searching the feature vector closest to the semantic vector space of the document to be classified in the vector database, and classifying the document to be classified into the category corresponding to the feature vector.

Further, a computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the automatic document classification method when executing the computer program.

Further, a computer readable storage medium stores a computer program which, when executed by a processor, implements the steps of the document automatic categorization method described above.

The invention has the beneficial effects that:

1. in the automatic document classifying method, the first step is offline, a semantic encoder capable of fully representing semantic features of a text is obtained through two-stage pre-training, and a large number of model training calculations are offline in the first step; the second step is performed online, the text is classified on the small data set by using an unsupervised method through the nearest neighbor thought, and meanwhile, the pressure of hardware can be relieved due to the fact that an unsupervised method is adopted in the online part.

2. The universal semantic encoder is trained on a large amount of data in the universal field, so that semantics can be effectively encoded, additional training can be omitted on a very small data set of a new actual scene, and the generalization capability is poor due to the phenomenon of overfitting.

3. The method can take effect only by changing the related documents or classification, does not need to retrain a model, and has good timeliness and low requirement on hardware.

Drawings

FIG. 1 is a schematic diagram of training data based on a language model in an embodiment of the present invention;

FIG. 2 is a schematic diagram of archive categorization based on semantic encoders in an embodiment of the invention.

Detailed Description

Specific embodiments of the present invention will now be described in order to provide a clearer understanding of the technical features, objects and effects of the present invention. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

Example 1

The embodiment provides a document automatic classifying method, which comprises the following steps:

s2, archiving and classifying based on semantic encoders: with the nearest neighbor idea, text is classified on a small dataset using an unsupervised approach based on a semantic encoder.

Step S1 is performed offline, a semantic encoder capable of fully representing semantic features of a text is obtained through two-stage pre-training, and a large number of model training calculations are performed offline in the step; the second step is performed online, the text is classified on the small data set by using an unsupervised method through the nearest neighbor thought, and meanwhile, the pressure of hardware can be relieved due to the fact that an unsupervised method is adopted in the online part.

Specifically, step S1 includes the following sub-steps:

s102, training by adopting similar text data: obtaining similar text data in the general field (for example, crawling similar text recommendation information of a website through a crawler) to form similar text pairs comprising anchor texts and similar texts, and randomly fetching a dissimilar text in a corpus for each similar text pair to form training data comprising the anchor texts, the similar texts and the dissimilar texts, so that the anchor texts and the similar texts are semantically related and semantically uncorrelated; training several pieces of training data based on language model, as shown in figure 1, respectively inputting anchor text, similar text and dissimilar text into the same language model (meaning encoder in figure 1), and respectively obtaining vectors V representing respective semantics _a ,V _p ,V _n And then, calculating a ternary loss function to obtain loss, and retraining a language model to obtain the semantic encoder.

For example, there is a pair of similar texts A, B, a and B that are semantically similar, and then there is a random extraction of texts C, C and A, B, neither of which have semantically similar relationships. Therefore, A can be used as anchor text, the coding vector of B text (meaning similar to A) is close to the vector of A, and the coding vector of C text (meaning dissimilar to A) is far away from A by means of ternary loss function in training process. In this process, the a text serves as an "anchor" as a benchmark. Similarly, the B text may also be used as anchor text.

More specifically, the ternary loss function may refer to the following expression:

loss＝max{||V _a -V _p ||2-||V _a -V _n || ₂ +margin，0}}

where loss is loss, a represents anchor text (anchor), p represents text (active) similar to anchor text, n represents text (negative) dissimilar to anchor text, V _a ,V _p ,V _n Respectively representing three texts, and obtaining the three texts after passing through an encoderIs described. I ₂ Represents the 2-norm of a certain vector, and thus V _a -V _p || ₂ Represents V _a And V _p Distance in space, ||V _a -V _n || ₂ Represents V _a And V _n Distance in space; margin is a constant that represents a desired spatial distance. Optimizing the entire ternary loss function means that V is desired _a And V _p Is greater than V _a And V _n Preferably, the difference exceeds margin.

The ternary loss function may pull the distance between anchor text and similar text, and keep the distance between anchor text and dissimilar text. Therefore, the semantic encoder can learn the capability of performing semantic encoding on the text, namely, the closer the semantic vectors obtained by encoding the text which is more similar through the semantic encoder are in space, the farther the semantic vectors are in space, otherwise, the semantic vectors are in space.

Specifically, as shown in fig. 2, step S2 includes the following sub-steps:

s201, constructing a classification system and uploading a plurality of documents to each classification: uploading the documents to a semantic encoder according to classification, encoding each document by the semantic encoder to obtain semantic vectors, and storing the semantic vectors into a vector database according to a classification system; the semantic vector under each category forms a vector set, and the cluster center of the vector set is calculated to be used as the feature vector of the category;

s202, classifying new documents: uploading a document to be classified by a user, and carrying out semantic vector coding by a semantic coder to obtain a semantic vector of the document to be classified; searching a feature vector closest to the semantic vector space of the document to be classified in the vector database, and classifying the document to be classified into a class corresponding to the feature vector.

Alternatively, the user can modify the classification system or the classified documents, and the feature vectors of the modified classification will change accordingly.

In addition, the embodiment provides an automatic document classification system, which comprises a semantic encoder, a vector database and a vector retrieval module, wherein:

the semantic encoder is trained on the basis of a language model according to unlabeled text data and similar text data, and is used for storing semantic vectors obtained by encoding classified documents into a vector database according to a classification system and outputting the semantic vectors obtained by encoding the documents to be classified to a vector retrieval module;

in the vector database, the semantic vector under each category forms a vector set, and the cluster center of the vector set is used as the characteristic vector of the category;

and the vector retrieval module is used for searching the feature vector closest to the semantic vector space of the document to be classified in the vector database and classifying the document to be classified into the category corresponding to the feature vector.

The embodiment also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps of the document automatic classification method.

The present embodiment also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the document automatic categorization method described above.

Example 2

This example is based on example 1:

the automatic document classification method of the embodiment comprises the following two stages:

1. system preparation phase

1. A large amount of text is crawled over a network, training a language model based on a self-attention architecture.

2. And crawling a large number of related questions and related document data on various websites, and forming similar text training data in a random sampling mode.

3. The semantic encoder is trained based on a language model through a large amount of similar text training data.

2. Implementation stage

Under the user scene, the documents can be divided into two types, namely financial documents and personnel documents, but only three sample data are respectively provided, and the implementation method is as follows:

1. the system converts three documents of financial class into semantic vectors through a semantic encoder, calculates cluster center vectors vector_counting of the three vectors, and stores the cluster center vectors vector_counting into a vector database. And performing the same operation on the personnel document to obtain a representative vector vector_hr of the personnel document. After the step, the classification system is built.

2. The user newly enters a document D and uploads it to the system. D is converted into semantic vectors through a semantic encoder, and if the nearest class vector is vector_hr through a vector search mode, the system can classify the document D into personnel documents.

Alternatively, if the user has a new type of document, such as a report type document, there are two sample data, the implementation method is as follows:

after a user newly builds a classification and uploads a sample, the semantic encoder converts the data of the two samples into semantic vectors, calculates cluster centers vector_report of the two vectors, and stores the cluster centers vector_report into a vector database. And after uploading the document to be classified newly, the user can add the report class into the consideration range when classifying.

Optionally, if the user adds or deletes the existing classified documents, the implementation method is as follows:

1. and deleting one of the financial documents by the user, automatically calculating the cluster center of the semantic vector of the rest financial documents by the system to obtain a new vector_counting, and storing the new vector_counting in a vector database. When the new document is then categorized, the financial class document will be represented by the new vector_accounting.

2. The user adds a financial document, the semantic encoder converts the new document into a semantic vector, the system automatically calculates cluster centers of semantic vectors of all documents of the financial document containing the newly added document, and the cluster centers are stored in a vector database to obtain a new vector_accounting. When the new document is then categorized, the financial class document will be represented by the new vector_accounting.

The foregoing is merely a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein but is not to be construed as excluding other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of modifications within the scope of the inventive concept, either as taught or as a matter of routine skill or knowledge in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1. An automatic document classifying method is characterized by comprising the following steps:

s1, training a language model offline according to similar text data: the method comprises the steps of pre-training in two layers, firstly training a language model through unlabeled text data, and then training to obtain a semantic encoder on the basis of the language model according to labeled data, namely similar text data;

s2, on-line archiving classification based on semantic encoders: classifying the text on a small data set by using an unsupervised method based on the semantic encoder by adopting a nearest neighbor idea;

step S1 comprises the following sub-steps:

s102, training by using similar text data: obtaining similar text data in the general field to form similar text pairs comprising anchor texts and similar texts, and randomly obtaining a dissimilar text in a corpus aiming at each similar text pair to form training data comprising the anchor texts, the similar texts and the dissimilar texts, so that the anchor texts and the similar texts are semantically related and are semantically unrelated; training a plurality of pieces of training data based on the language model, respectively inputting anchor point text, similar text and dissimilar text into the same language model, and respectively obtaining vectors representing respective semanticsV _a , V _p , V _n Then, calculating a ternary loss function to obtain loss, and retraining the language model to obtain a semantic encoder;

the expression of the ternary loss function is as follows:

wherein,lossto be lost, ||V _a - V _p || ₂ Representation ofV _a AndV _p distance in space, |V _a - V _n || ₂ Representation ofV _a AndV _n the distance in the space is such that,marginis constant and represents a desired spatial distance; the ternary loss function can shorten the distance between the anchor point text and the similar text, and the distance between the anchor point text and the dissimilar text is shortened;

step S2 comprises the following sub-steps:

s202, classifying new documents: uploading a document to be classified by a user, and carrying out semantic vector coding by the semantic coder to obtain a semantic vector of the document to be classified; searching a feature vector closest to the semantic vector space of the document to be classified in the vector database, and classifying the document to be classified into a class corresponding to the feature vector;

the user can modify the classification system or the classified documents, and the feature vectors of the modified classification change accordingly.

2. The method according to claim 1, wherein the semantic encoder learns the ability to encode text semantically, i.e. the closer in space the semantic vectors encoded by the semantic encoder are, the farther in space the semantic vectors are.

3. The method for automatically classifying documents according to claim 1, wherein said obtaining similar text data in the general field comprises: and crawling similar text recommendation information of the website through the crawler.

4. An automatic document classification system based on the automatic document classification method according to claim 1, comprising a semantic encoder, a vector database and a vector retrieval module;

5. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1-3 when the computer program is executed.

6. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any of claims 1-3.