CN115048515A

CN115048515A - Document classification method, device, equipment and storage medium

Info

Publication number: CN115048515A
Application number: CN202210650047.7A
Authority: CN
Inventors: 韦中普
Original assignee: Guangxi Liyi Intelligent Technology Co ltd
Current assignee: Guangxi Liyi Intelligent Technology Co ltd
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-09-13

Abstract

The invention discloses a document classification method, a device, equipment and a storage medium, wherein the document classification method comprises the following steps: extracting keywords of a document to be classified to form a keyword knowledge base, and acquiring word embedded vectors of the document to be classified and word embedded vectors of the keyword knowledge base through an improved Bert model; and then, fusing the word embedded vector of the document to be classified and the word embedded vector of the keyword knowledge base through an attention mechanism, enhancing the correlation between the document to be classified and the keyword knowledge base, providing more effective characteristic information for a document classifier, reducing the interference of invalid information and improving the accuracy of long-document classification.

Description

Document classification method, device, equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for classifying documents.

Background

In recent years, with the trend of deep learning, many models have been proposed in the field of natural language processing technology, such as word2vec model based on feature and machine learning, Bi-LSTM-Attention model based on deep learning, and continuously improved convolutional neural network model. Among them, the Bert model based on the Transformer model has excellent performance in training and practical test.

However, when the Bert model in the prior art is used for solving the problem of document classification, the model and the document to be classified are lack of deep semantic fusion, and are greatly influenced by irrelevant factors in the data training process, so that the document category analysis is fuzzy, the document classification effect is poor due to overlong space and containing a large amount of invalid information, and each classification boundary is easy to be confused.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, a device and a storage medium for classifying documents, which reduce the interference effect of irrelevant elements and highlight the effect of effective key information on document classification compared with the existing method and apparatus for classifying documents using a Bert model, thereby improving the classification accuracy of documents.

In a first aspect, the present invention provides a document classification method, including the following steps:

extracting keywords of the documents to be classified, and constructing a keyword knowledge base;

acquiring word embedding vectors of the documents to be classified and word embedding vectors of the keyword knowledge base;

fusing the word embedded vectors of the documents to be classified and the word embedded vectors of the keyword knowledge base through an attention mechanism to obtain fused vectors;

and taking the fusion vector as the input of a full connection layer, and obtaining the classification result of the document to be classified from the output of the full connection layer.

In a possible implementation manner of the first aspect of the present invention, the constructing the keyword knowledge base specifically includes the following steps:

preprocessing a document to be classified;

extracting a document vector of a preprocessed document to be classified through a first Bert model, wherein the first Bert model is a Bert model which is trained in advance, and the pre-training process comprises two training tasks of MLM and NSP;

extracting word vectors of candidate keywords from the document vectors by using an N-element grammar model;

and calculating the similarity between the word vector of the candidate keyword and the document vector, and determining the keywords of the document to be classified according to the similarity to form a keyword knowledge base.

Preferably, the similarity between the candidate keyword and the document to be classified is determined by calculating a cosine value between a word vector of the candidate keyword and a document vector.

In a possible implementation manner of the first aspect of the present invention, a word embedding vector of a document to be classified and a word embedding vector of the keyword knowledge base are obtained through a second Bert model, where the second Bert model is a Bert model that has been pre-trained, and a pre-training process of the second Bert model does not include an NSP training task.

In one possible implementation of the first aspect of the invention, the attention mechanism is a multi-head attention mechanism.

In a second aspect, the present invention provides a document classification apparatus, including:

the keyword generation module is used for extracting keywords of the documents to be classified and constructing a keyword knowledge base;

the word embedding module is used for acquiring word embedding vectors of the documents to be classified and word embedding vectors of the keyword knowledge base;

the fusion module is used for fusing the word embedded vectors of the documents to be classified and the word embedded vectors of the keyword knowledge base through an attention mechanism to obtain fusion vectors;

and the full connection layer is used for acquiring the classification result of the document to be classified through the fusion vector.

In a possible implementation manner of the second aspect of the present invention, the keyword generation module includes:

the document embedding module is used for extracting document vectors of the preprocessed documents to be classified, the document embedding module comprises a first Bert model which is trained in advance, and the pre-training process of the first Bert model comprises an MLM (multi level memory) training task and an NSP (non-subsampled processing) training task;

the candidate keyword generation module comprises an N-element grammar model and a candidate keyword generation module, wherein the N-element grammar model is used for extracting word vectors of the candidate keywords from the document vectors;

and the keyword selection module is used for calculating cosine similarity between word vectors of the candidate keywords and document vectors and determining the keywords of the documents to be classified according to the cosine similarity.

In a possible implementation manner of the second aspect of the present invention, the word embedding module includes a second Bert model that has been pre-trained, a pre-training process of the second Bert model includes an MLM training task but does not include an NSP training task, and the MLM training task uses a dynamic mask manner.

In a third aspect, the present invention provides a document classification device, including: a memory and at least one processor, the memory having stored therein a computer program; the at least one processor invokes the computer program in the memory to cause the document classification device to perform the document classification method of the first aspect of the invention.

In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, the computer program, when being executed by a processor, implementing the document classification method according to the first aspect of the present invention.

According to the technical scheme provided by the invention, the keywords of the document to be classified are extracted to form the keyword knowledge base, the information of the keyword knowledge base and the document to be classified are fused through a multi-head attention mechanism, the correlation between the document information embedded by the second Bert model and the keyword knowledge base is enhanced, more effective characteristic information is provided for a document classifier, the interference of invalid information is greatly reduced, and the accuracy of long document classification is improved.

In addition, because the content of the keyword knowledge base of the technical scheme of the invention is the keyword information in the document to be classified, the whole keyword knowledge base does not need to be constructed in advance, but a whole document classification model is constructed, and when the model runs, the keyword information is extracted and the model training is carried out at the same time, so that the consumption of the memory can be reduced, the complex management required by a large amount of information can be avoided, the generation of error information is reduced, and the stability and the reliability of the model running are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a schematic diagram of a document sorting apparatus according to embodiment 1 of the present invention;

fig. 2 is a flowchart of a document classification method provided in embodiment 2 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

It is stated that: the terms "comprises," "comprising," or "having," and any variations thereof, herein are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," "third," "fourth," and the like in the description herein are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the numbering data used may be interchanged where appropriate to facilitate describing the order of implementation other than in the present embodiments.

Example 1

Referring to fig. 1, a document classifying device provided in embodiment 1 of the present application includes:

the document preprocessing module is used for preprocessing documents to be classified, such as document data segmentation and word segmentation;

the document embedding module is used for acquiring a document vector of a document to be classified to obtain document level representation;

the keyword selection module is used for calculating cosine similarity between word vectors and document vectors of the candidate keywords and determining keywords of the documents to be classified according to the cosine similarity;

the keyword knowledge base is used for storing the selected keywords;

It should be understood that, in some possible embodiments of the present invention, the document classification apparatus may further include a document to be classified acquisition module, such as an OCR module, for scanning an image to obtain a document to be classified; in other possible embodiments of the present invention, the document classification apparatus may further include a document error correction module, configured to correct the acquired document to be classified.

Example 2

Referring to fig. 1 and fig. 2, an embodiment 2 of the present invention adopts the document classification device provided in embodiment 1, and provides a document classification method, including:

s1, extracting keywords of the document to be classified, and constructing a keyword knowledge base;

step S2, obtaining word embedding vectors of the documents to be classified and word embedding vectors of the keyword knowledge base;

s3, fusing the word embedding vector of the document to be classified and the word embedding vector of the keyword knowledge base through an attention mechanism to obtain a fusion vector;

and step S4, taking the fusion vector as the input of the full connection layer, and obtaining the classification result of the document to be classified from the output of the full connection layer.

In the related art, the way of extracting the document keywords mainly includes a keyword extraction method based on statistics and a keyword extraction method based on machine learning. The keyword extraction method based on statistics calculates the weight of words in a document according to statistical information such as word frequency, and extracts keywords according to the weighted value sequence. TF-IDF and TextRank both belong to the methods, wherein the TF-IDF method obtains word weight by calculating word Frequency (TF) and Inverse text Frequency Index (IDF); the TextRank method is based on the idea of PageRank, a co-occurrence network is constructed through a word co-occurrence window, and word scores are calculated. The keyword extraction method based on statistics is simple and easy to implement, however, the word order problem is not considered, and the keyword extraction accuracy of documents with too long space and containing a large amount of invalid information is poor. The keyword extraction method based on machine learning comprises supervised learning methods such as SVM and naive Bayes, and unsupervised learning methods such as K-means and hierarchical clustering. According to the keyword extraction method based on deep learning, the quality of a model depends on the effect of feature extraction.

The embodiment 2 of the present invention classifies long chinese documents, and in this embodiment, the method for extracting keywords from a document to be classified in step S1 specifically includes the following steps:

step S11, preprocessing the document to be classified;

step S12, extracting the document vector of the preprocessed document to be classified through a first Bert model, wherein the first Bert model is a Bert model which is trained in advance, and the pre-training process comprises two training tasks of MLM and NSP;

step S13, extracting word vectors of candidate keywords from the document vectors by using an N-element grammar model;

and step S14, calculating the similarity between the word vector of the candidate keyword and the document vector, and determining the keyword of the document to be classified according to the similarity to form a keyword knowledge base.

In step S11 of embodiment 2, the preprocessing step of the document to be classified includes document data segmentation and word segmentation. When documents are different in space, the document data segmentation method adopted in embodiment 2 specifically operates as follows: first, an appropriate fixed space, denoted as pad _ size, is set, and the document data set is divided into two categories, long and short, according to the pad _ size. For the short document, performing tail filling operation on the word vector at token level to keep the length as pad _ size, and performing corresponding filling operations of '0' and '1' on Mask; for "long" documents, the word vectors are tailorably cropped at the token level so that the final word vector length is also kept as pad _ size, while Mask is set to pad _ size "1". Therefore, documents with different lengths are embedded into the same token size, and effective key information is added to the word vectors. In embodiment 2, a jieba chinese tokenizer is used to tokenize the documents to be categorized, and in other possible embodiments, other tokenizing tools may be used, such as lac (local Analysis of chinese), HanLP (chinese language processing package), and the like.

In step S12 of embodiment 2, a document-level representation is obtained by extracting a document vector of a document to be classified using a Bert model that has been pre-trained. The Bert model is called Bidirective Encoder reproduction from transforms, is a transform-based bi-directional Encoder Representation, and is a pre-trained language characterization model. Unlike the traditional one-way language model or the method of pre-training two one-way language models by shallow concatenation, the pre-training process of the first Bert model in embodiment 2 mainly includes two pre-training tasks, i.e., MLM and NSP, so as to generate a deep two-way language representation. MLM (masked language model), i.e., masking certain positions in the input sequence at random and then predicting them by the model. Nsp (next sequence prediction), i.e., two utterances are input into the model at the same time, and then it is predicted whether the 2 nd utterance is the next utterance which is not the 1 st utterance.

In step S13 of embodiment 2, a word vector is extracted from the document vector by the N-gram model as a vector of candidate keywords. The N-gram is a word-connection classification model, which represents several words continuously appearing in a phrase or sentence, and the calculation principle is shown in formula (1-1).

In the formula (1-1), p (w) ₁ ，w ₂ ，...，w _m ) Representing the probability distribution of a sentence using a trigram model.

In step S14 of embodiment 2, two cosine similarity-based algorithms are used to select keywords from the candidate keywords, so as to improve the diversity of the keywords, and the cosine similarity calculation method is shown in formula (1-2). In this embodiment, there are top2N (the top2N with the highest probability of being selected by the N-gram model) candidate keywords, and top N keywords are determined from the top2N candidate keywords. The first is a maximum similarity algorithm, firstly selecting word vectors of top2n candidate keywords to be combined with a document, taking out all top _ n combinations from top2n words, and then extracting the most dissimilar combinations through the similarity of redundant strings to determine the most dissimilar combinations as keywords, thereby avoiding the repetition of the extracted keywords and enhancing the diversity of the extracted keywords. The second is to use the maximum edge correlation (MMR) to create keywords or key phrases that are also based on cosine similarity, as shown in equations (1-3).

In the formula (1-2), x _i And y _i And respectively representing word vectors of candidate keywords of the ith dimension and document vectors of documents to be classified, wherein n represents the dimension number of the vectors.

In formulas (1-3), Q and C both represent the entire document, and R is oneAn initial set based on the correlation that has been found,

index referring to K sentences returned by search, d _i Then it represents a certain sentence in the document, λ sim (Q, d) _i ) Refers to the similarity between a sentence in a document and the entire document,

refers to the similarity between a certain sentence in the document and the extracted abstract sentence.

In embodiment 2, step S2 obtains the word embedding vector of the document to be classified and the word embedding vector of the keyword knowledge base through the second Bert model. The second Bert model is a Bert model which is already trained in advance, and the pre-training process of the second Bert model does not include the NSP training task. Unlike the first Bert model in step S1, the second Bert model modifies the key superparameters in the first Bert model, including deleting NSP training tasks and training with larger backsize and learning rate, for example, backsize varying from 256 to 8000. The second Bert model uses dynamic masking in the MLM pre-training task, i.e., a new mask pattern is generated each time a sequence is input to the model. Therefore, in the process of continuously inputting a large amount of data, the model can gradually adapt to different mask strategies, different language representations are learned, and word embedding vectors obtained through the model have rich semantics.

In embodiment 2, step S3 fuses the word embedding vector of the document to be classified and the word embedding vector of the keyword acquired in step S2. The method comprises the following steps of fusing, namely, capturing the correlation between a word embedding vector of a document to be classified and a word embedding vector of a keyword through a multi-head attention mechanism to generate a semantic tag; and the second part is to splice the semantic tags and word embedded vectors of the documents to be classified to form fusion vectors. In particular, by

Representing the output to be output via the second Bert modelWord-embedded vectors for classified documents, where n is the number of sentence input words, d _h Is the output dimension of the second Bert model; by using

A word embedding vector representing the keyword output via the second Bert model, wherein d _K A word representing a keyword is embedded in a dimension of the vector, and d _K ＝d _h ＝768。

Compared with a standard zoom dot product attention mechanism, the multi-head attention mechanism adopted in embodiment 2 of the present invention is easier to capture the correlation between the word embedded vector of the document to be classified and the word embedded vector of the keyword knowledge base, and can better acquire potential information, so that the model is more concerned with the content contained in the keywords. In the prior art, the standard formula of the zoom dot product attention mechanism is:

the multi-head attention mechanism is shown in the formula (1-5):

wherein

Representing the parameters learned during the model training process, and k represents the number of attention heads. In example 2, the attention mechanism calculation formula is shown as formulas (1 to 6): (hidden layer changed from original small H to H)

H ^l ＝MultiHead(Q，K，V)+Q (1-6)

Wherein Q is H ^c Representing the document to be classified; k, V, H ^K Representing a keyword knowledge base, the output semantic label of the attention mechanism is

Finally the documents to be classifiedWord embedding vector and semantic tag

Splicing to obtain the output H of the fusion module, as shown in the formula (1-7):

and inputting the H into a full connection layer of the fourth module, and outputting the full connection layer to obtain a document classification result.

In order to better prove the document classification effect of embodiment 2 of the present invention, a document classification model in the prior art is selected as a comparison example, and a classification task is performed on the same document to be classified.

In comparative example 1, a standard Bert model was used.

Comparative example 2 adopts Bert + DPCNN model, DPCNN is called Deep Pyramid conditional New Networks for Text Categorization, and it is a word-level based document feature extraction model proposed by Tencent AI-lab.

Comparative example 3 employs a Bert + Keyword model that first generates a domain-enhanced Keyword dictionary using domain labels from a large corpus. Then, based on BERT, a keyword-attribute transform layer is added on top of the BERT to highlight the importance of the keywords in the query-query pair.

Comparative example 4 adopts an ERNIE model, which is a hundred-degree continuous learning semantic understanding framework based on knowledge enhancement, and combines big data pre-training with multi-source rich knowledge, and continuously absorbs knowledge in terms of vocabulary, structure, semantics and the like in mass document data through a continuous learning technology, thereby realizing continuous evolution of model effects.

In the embodiment 2 of the invention, after Early-stopping strategy is adopted for multiple parameter adjustment and optimization in model training, the optimal model after training is stored, and test and evaluation are carried out. The specific parameter adjustment optimization measures comprise: based on the model training duration, the resource occupation and the stability of the model convergence, after multiple experiments, it is determined that the Batch _ Size is 128, the initial learning rate lr is set to 0.00003, the dropout is set to 0.1, the multi-Attention layer head number is 6, and the hidden layer dimension of the Attention layer is 768.

The documents to be classified of the control experiment originate from papers in the website. Firstly, classifying collected documents in a folder construction mode; then, a data mining package provided by python is used for creating a regularization extraction task of data, so that document content conforming to a data set format is mined; and finally, matching the document content with the categories by using the script so as to generate a data annotation type which accords with the model input. The data set is composed of three parts, wherein the first part comprises a stamp and a category thereof (including a grade category of information security and a content property category), wherein the two major classification standards represent the categories in a mode of replacing characters by numbers, such as the number 1 represents the public category; the second part is a summary of all categories classified according to information security level, including 16 large-level categories such as public, secret, restriction, etc.; the third section is a summary of all categories divided by the nature of the content, including 10 broad categories of science, economics, sports, entertainment, politics, etc. 10000 test cases are adopted in total, and 1000 test cases are divided in each class in 10 large classes.

The document classification effect of the control experiment is shown in table 1:

TABLE 1 document Classification control experiment results

Wherein, the accuracy of classification is represented by P, the recall ratio is represented by R, and the calculation formula of F1 is

It can be seen that the document classification method provided by the embodiment of the present invention has higher accuracy, recall rate and F1 of the generated document classification than the document classification method using several baseline models in the comparative example. Particularly, the ERNIE model in the comparative example 4 is added by constructing a knowledge graph as the fusion of external knowledge and language semantic informationThe semantic meaning is strong, but the classification effect expressed in the experiment is not the same as that of the classification model constructed in the embodiment, and it can be seen that the classification model constructed in the embodiment can capture more critical boundary information for classification.

The classification results of the ten major classes in the embodiment of the present invention are shown in table 2:

table 2 classification results for different categories of documents according to embodiment 2 of the present invention

The technical scheme of the embodiment of the invention has a very good classifying effect on education and sports long texts.

Example 3

Embodiment 3 of the present invention provides a document classification device, which includes 1 memory and 1 processor. The memory stores a computer program, and the processor calls the computer program in the memory to complete the document classification task according to the document classification method described in embodiment 2. Specifically, the document error correction apparatus in embodiment 3 is a smartphone. In other possible embodiments, the document error correction device may also be other intelligent terminal devices such as a PC personal computer, a desktop computer, a platform computer, and a smart watch. The processor can be a chip built in the intelligent terminal, a cloud server or other external servers. It should be understood that any device capable of performing the document classification task according to the document classification method provided by the present invention is within the scope of the present invention.

Example 4

Embodiment 4 of the present invention provides a computer-readable storage medium on which a computer program is stored. The computer program when executed by a processor performs a document classification task according to the document classification method described in embodiment 2. Specifically, the computer readable medium in embodiment 4 is a portable hard disk. In other possible embodiments, the computer readable medium may also be a storage device in other forms such as an optical disc, a computer memory card, a mobile phone memory card, and the like, and it should be understood that the storage medium is within the scope of the present invention as long as the program on the storage medium can complete the document classification task according to the document classification method provided by the present invention.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method of classifying a document, comprising the steps of:

2. The method for classifying documents according to claim 1, wherein the step of constructing a keyword knowledge base specifically comprises the steps of:

preprocessing a document to be classified;

3. The method of claim 2, wherein the similarity between the candidate keyword and the document to be classified is determined by calculating cosine values between word vectors and document vectors of the candidate keyword.

4. The document classification method according to claim 1, wherein a second Bert model is used to obtain word embedded vectors of the documents to be classified and word embedded vectors of the keyword knowledge base, the second Bert model is a Bert model that has been pre-trained, and a pre-training process of the second Bert model does not include an NSP training task.

5. The document classification method according to claim 1, wherein the attention mechanism is a multi-head attention mechanism.

6. A document sorting apparatus, comprising:

7. The apparatus of claim 6, wherein the keyword generation module comprises:

the document embedding module is used for extracting a document vector of a preprocessed document to be classified, the document embedding module comprises a first Bert model which is pre-trained, and the pre-training process of the first Bert model comprises an MLM (Multi level Module) training task and an NSP (non-subsampled Path) training task;

8. The apparatus according to claim 6, wherein the word embedding module includes a second Bert model that has been pre-trained, the pre-training process of the second Bert model includes an MLM training task but not an NSP training task, and the MLM training task uses a dynamic mask method.

9. A document sorting apparatus, characterized in that the document sorting apparatus comprises: a memory and at least one processor, the memory having stored therein a computer program; the at least one processor invokes the computer program in the memory to cause the document classification device to perform the document classification method of any of claims 1-5.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the document classification method according to any one of claims 1 to 5.