CN116541527B

CN116541527B - Document classification method based on model integration and data expansion

Info

Publication number: CN116541527B
Application number: CN202310817468.9A
Authority: CN
Inventors: 汪海涛; 代志强; 李炳辉; 刘会龙; 许禹诺; 陈斌发; 李晖; 王登政; 郭大勇; 林凯; 闫晓栋; 费长顺
Original assignee: Beijing Sgitg Accenture Information Technology Co ltd; State Grid Beijing Electric Power Co Ltd
Current assignee: Beijing Sgitg Accenture Information Technology Co ltd; State Grid Beijing Electric Power Co Ltd
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-09-29
Anticipated expiration: 2043-07-05
Also published as: CN116541527A

Abstract

The invention belongs to the field of intelligent document classification, and particularly relates to a document classification method based on model integration and data expansion. Acquiring a plurality of department files and marking the department names as labels correspondingly; the data set analysis dirty data processing, sample statistics and determination of small samples and large batches of samples; creating department associated hotword tables, expanding small samples and deleting large batches of samples; modeling separately using BiLstm, ALbert, XLNet; and realizing intelligent classification of the official documents through logic judgment. Through the data analysis and establishment department high-frequency word list, text features are screened by utilizing high-frequency words, and the rapid expansion of a data set and sample equalization are realized. The problems of overlong text and time consumption in single model deployment are solved, and the intelligent document classification technology with relatively reasonable operability, high accuracy and short reasoning time is comprehensively realized by means of accuracy and reasoning time interchange based on model integration and logic screening.

Description

Document classification method based on model integration and data expansion

Technical Field

The invention belongs to the field of intelligent document classification, and particularly relates to a document classification method based on model integration and data expansion.

Background

The quantity of various document files transferred every day in an enterprise is huge, and a huge quantity of document files are precisely transferred to tens of working departments, so that a large amount of time is consumed by workers to turn over and sort, and the documents can be accurately transferred to the related departments. The huge effort and lengthy turn-around time always affect the working timeliness. Therefore, by means of artificial intelligence technology, the document intelligent classification flow is realized to the related departments based on semantic information analysis of the relevant contents of the document, and the document intelligent auxiliary classification flow requirement is met.

The document file comprises various functions, notices, work contact lists and the like, and the document text formats, lengths and contents of different departments have great differences. When modeling is performed by using an artificial intelligence technology, the traditional small model RNN, textCNN, biLstm network layer is simple in structure, few in parameter quantity, quick in calculation, high in accuracy and suitable for short text document classification, but if thousands of words or even tens of thousands of words of document texts are used for modeling, prediction and classification, error circulation with extremely high probability is caused by overlong text extraction semantic feature differences. When the method is used for solving the document classification based on the Bert and Albert large model modeling, the defect that the effect of recognizing short documents is good and the effect of recognizing long documents is poor can be overcome compared with a classical RNN series model, but only document documents within 512 characters can be processed, so that the requirement cannot be perfectly solved, the semantic recognition effect of the text model exceeding 2048 characters can be reduced when the XLNet model modeling for processing the ultra-long text is utilized, and compared with the RNN series model, the Bert, albert, XLNet model has reasoning waiting when continuous ultra-long text classification occurs due to overlarge calculation time consumption of the parameter quantity per se in the text prediction calculation. All three modeling modes have the defect that quick and accurate intelligent classification of documents cannot be realized when modeling is carried out independently.

And if the company departments are quite a lot, the number of the documents associated with each department is quite large, and some documents are associated with two departments and three departments even more, compared with the number of the documents of one department, the number of the documents of a plurality of departments is quite different by several times or even tens of times, the number of the documents of one department is quite large, the sample number difference greatly affects model training, when sample balanced enhancement data are processed, text back translation and random word order replacement are commonly used at present, and if the documents are not only related field documents but also document contents are not replaced in a near-meaning manner, the text back translation and the random word order replacement are not available, and the sample balancing and enhancement data are required to be quickly realized by expanding small sample data aiming at the related field documents. Therefore, research and design of an intelligent document classification technology with relatively reasonable operability, improved accuracy and reduced time consumption are needed, and the problems of small document samples and sample equalization of multiple departments are solved.

Disclosure of Invention

The invention provides a document classification method based on model integration and data expansion, which aims to solve the problems that the existing classification counting accuracy is low, the consumed time is high, the rapid and accurate intelligent classification of documents cannot be realized, and the sample equalization cannot be realized.

In order to achieve the above purpose, the invention proposes the following technical scheme:

a document classification method based on model integration and data expansion comprises the following steps:

step 1, collecting a plurality of department files, establishing a data set, and taking the department names as labels of the corresponding files;

step 2, deleting dirty data in the data set, and screening out small sample tags with the sample number smaller than 50 and large sample tags with the sample number larger than 300 in the data set;

step 3, establishing high-frequency word lists of different departments, expanding small sample label data according to the high-frequency word lists of different departments, and deleting large sample label data; obtaining a processed data set;

and 4, respectively modeling by using BiLstm, ALbert, XLNet according to the processed data set to obtain a Bilstm model, an ALbert model and an XLNet model which can be used for classification.

Preferably, in step 1, text length and text content are used as features to screen text, department names are used as labels to mark the collected documents, and a data set of one document corresponding to one or more labels is formed.

Preferably, in step 2, deleting dirty data in the data set is specifically deleting department files with the number of characters less than 50 characters and the number of characters greater than 10000 characters.

Preferably, in step 2, the method further includes the steps of counting the number of samples contained in each tag and calculating the average value of the samples contained in all tags after the dirty data is deleted.

Preferably, in step 3, high-frequency word lists of different departments are constructed by using TF-IDF modeling.

Preferably, in step 3, the expanding of the label labeling data of the small sample specifically includes:

dividing the text according to sentences, judging whether each sentence contains high-frequency words, if the high-frequency words exist, reserving the sentence as text characteristics, and splicing sentences containing the high-frequency words to generate a new sample.

Preferably, in step 3, expanding the small sample label labeling data further includes:

and (3) directly cutting and splicing the new sample, so as to ensure that the data of the small sample is amplified to be 6 times of the original data.

Preferably, the deleting of the label marking data of the large sample is specifically:

counting high-frequency words contained in single texts in a large number of samples, and deleting texts with the number of the high-frequency words lower than 10 to realize first step deletion;

and deleting samples with text lengths exceeding 4096 characters in a large number of samples by taking the sample mean value as a target to realize secondary deletion.

Preferably, the number of high frequency words contained in a single text is counted by TF-IDF.

Preferably, the method further comprises the step 5 of collecting target text data and calling a corresponding model to classify according to the target text data; the method comprises the following steps:

the Bilstm model is used for processing text with the length smaller than 256 characters; the ALbert model is used for processing texts with lengths of more than 256 characters and less than 512 characters; the XLNet model is used to process text with a length greater than 512 characters and less than 2048 characters.

The invention has the advantages that:

establishing a high-frequency word list through data analysis, screening text features by utilizing high-frequency words, and performing small sample data expansion and large-batch sample deletion, thereby realizing rapid expansion of a data set and sample equalization; meanwhile, a multi-model integration method is adopted for modeling, classified and screened texts are processed in a targeted manner, the accuracy and the prediction speed are improved, the problem of difficult modeling caused by text length difference is solved, and an intelligent document classification technology with relatively reasonable operability, high accuracy and short reasoning time is comprehensively realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a schematic flow diagram of a document classification method based on model integration and data expansion;

FIG. 2 is a schematic diagram of a Bilstm model iteration F1-score index;

FIG. 3 is a schematic diagram of a Bert model iteration F1-score index;

FIG. 4 is a schematic representation of an XLNet model iteration F1-score index.

Detailed Description

The invention will be described in detail below with reference to the drawings in connection with embodiments. It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

The following detailed description is exemplary and is intended to provide further details of the invention. Unless defined otherwise, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the invention.

Example 1:

referring to fig. 1, the invention provides a document classification method based on model integration and data expansion, which specifically comprises the following steps:

step 1, acquiring a plurality of department files, and correspondingly labeling the department names as labels to obtain a data set;

step 11, collecting and sorting documents of different departments of an enterprise, wherein the collected documents need to cover all file characteristics of the departments, the text length and the text content are taken as characteristic screening and collecting documents, and the quantity balance of the texts of different characteristics is ensured;

and 12, labeling the collected documents by using the department names as labels to form a data set of one document corresponding to one or more labels.

Step 2, carrying out dirty data processing on the data set, carrying out statistics on samples in the data set, and determining small samples and large batches of samples;

step 21, analyzing the data set, and deleting dirty data in the data set; counting the number of single sample characters in a data set, and deleting documents with the number of characters less than 50 characters and the number of characters greater than 10000 characters;

in step 22, the number of samples contained in each tag is counted and the average value (target value of sample equalization) of all the tags is calculated, and small sample tags with the number of samples less than 50 and large sample tags with the number of samples greater than 300 are screened out.

Step 3, creating department associated hotword list, expanding small samples and deleting large batches of samples;

and step 31, analyzing and comparing text contents, wherein if the industrial document content has strong field, the salient features of part of the document keywords are obvious. For example, the document content of the equipment part of the power grid company comprises: the electric pole, the transformer, the insulator, the equipment fault, the equipment abnormality and the flange; the document content of the development planning part comprises: planning a transformer substation, creating a power plant, creating a wind power plant, and a central station and a quality monitoring station; the document texts of different departments comprise different territorial high-frequency words, the high-frequency words with characteristic directivity of each category are extracted by using TF-IDF modeling to construct hot word lists of different departments of the industry, and the hot words of the departments of the industry are used as a condition for sample expansion and deletion;

step 32, small sample data expansion, first step: dividing the text according to sentences, judging whether each sentence contains high-frequency words, if the high-frequency words exist, reserving the high-frequency words as text features, finally splicing the sentences containing the high-frequency words to generate new samples, reserving the sentence meaning and reserving the department text features for one-time expansion;

and a second step of: the small sample direct segmentation and splicing realizes secondary expansion, for example, the characters of the beginning 128 and the end 128 of the original text are reserved (the text is segmented according to 32 when the total length is smaller than 128), the middle part is segmented into four parts, and the random combination of the beginning part, the end part, the middle four parts and the middle four parts can generate four samples with different contents of the same label; at this time, the small sample data is amplified to be 6 times of the original data on the original basis;

step 33, deleting a large number of samples, counting high-frequency words contained in single texts in the large number of samples by using TF-IDF, deleting texts with the number of the high-frequency words lower than 10 to realize first-step deletion, and deleting samples with the text length exceeding 4096 characters in the large number of samples to realize second-step deletion by taking a sample mean value as a target; the integration is such that the number of samples contained in each category in the final dataset is approximately equal.

Step 4, modeling by using BiLstm, ALbert, XLNet respectively;

step 41, modeling by using the bit tm, the text length is less than about 80% of the 256-character prediction accuracy. The F1-score is used as a model evaluation index, and the model iteration F1-score index is shown in a drawing 2:

in step 42, using Bert modeling, the text length is greater than 256 characters and less than 512 characters, and the prediction accuracy is about 76%. The F1-score is used as a model evaluation index, and the model iteration F1-score index drawing 3 shows that:

step 43, using XLNet modeling, the prediction accuracy of text length greater than 512 characters and less than 2048 characters is about 82%. The F1-score is used as a model evaluation index, and a model iteration index F1-score is shown in a drawing 4:

and 44, independently modeling and comparing the three models, wherein the F1-score curve tends to be stable when the predictive evaluation indexes of texts with different lengths are iterated for 20 rounds or 30 rounds, and respectively calling the models on a CPU server to independently predict the categories of 100 documents (the text lengths are less than 256, more than 256 and less than 512, and the number distribution of more than 512 and less than 2048).

The single sample takes about 0.1 second when the length of the text predicted by the Bilstm is smaller than 256 characters of the document, the single text which is larger than 256 characters takes about 1 second, the prediction accuracy can be reduced as the time consumption increases with the increase of the length of the text, the total time consumption of the 100 samples is about 100 seconds when the model is independently modeled, and the model prediction accuracy is about 48%.

When the Albert prediction reasoning length is smaller than 512 centimeters, a single sample takes about 1.5 seconds, when the text with the length larger than 512 is predicted, the accuracy of the extracted feature is reduced due to the limited length, the total time of comprehensively reasoning 100 samples is about 150 seconds, and the model prediction accuracy is about 56%. Compared with the Bilstm modeling accuracy, the method improves 8%, and the reasoning time consumption is increased by 1.5 times.

When the XlNet prediction reasoning length is smaller than 1024-character documents, a single sample takes about 2 seconds, when the reasoning time of the single text with larger than 1024 characters is increased by about 12 seconds at maximum along with the increase of the text length, the prediction accuracy is increased by about 70% along with the increase of the text length, and the total time of reasoning about 1200 seconds is taken for 100 samples. Modeling accuracy is highest compared to ALbert, biLstm, but reasoning takes about 8-fold and 1-fold more time.

Step 5, after the model is established, intelligent classification of the official document is realized according to logic judgment and model integration deployment;

and adopting logic judgment to screen proper models to predict document files with different lengths, calling the Bilstm when the length of the input text is smaller than 256 characters, and calling the ALbert model when the length of the input text is larger than 256 characters and smaller than 512 characters, so as to solve the problem that the effect of processing long text by the Bilstm is poor, and using time to change the accuracy. When the length of the incoming text is larger than 512 characters and smaller than 2048 characters, the XLNet model is called, and the limitations that the effect of processing long text by using the Bilstm is poor and the ALbert can only process 512 characters are solved by using the time-varying accuracy.

When the three models are deployed independently, the time consumption for reasoning is high, when the time consumption for reasoning is short, the accuracy is low, the requirements of on-line document classification cannot be met completely, the three models BiLstm, ALbert, XLNet are integrated to realize document classification, a logic screening model is used before a text is transmitted into the model and then is called, 100 documents are utilized to screen the random calling model according to logic, the total time consumption for reasoning is about 420 seconds, the accuracy is about 78%, and the time and accuracy are exchanged to comprehensively shorten the time and improve the accuracy. Compared with BiLstm, ALbert, XLnet, the independent modeling accuracy is high, the reasoning time is reduced by about 1/3 compared with that of XLnet independent modeling, and the GPU acceleration reasoning time is improved to be 1/4 of that of a CPU.

The invention has the advantages that:

based on artificial intelligence, intelligent text classification is realized, and manual turning classification circulation official documents are replaced.

The problems of overlong text and time consumption in single model deployment are broken through, and the intelligent document classification technology with relatively reasonable operability, high accuracy and short reasoning time is comprehensively realized by means of accuracy and reasoning time interchange based on model integration and logic screening.

The high-frequency word list is established through data analysis, the text features are screened by utilizing the high-frequency words, so that the small sample data expansion and the large sample deletion are realized, and the problems of rapid expansion of a data set and sample equalization are realized.

It will be appreciated by those skilled in the art that the present invention can be carried out in other embodiments without departing from the spirit or essential characteristics thereof. Accordingly, the above disclosed embodiments are illustrative in all respects, and not exclusive. All changes that come within the scope of the invention or equivalents thereto are intended to be embraced therein.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A document classification method based on model integration and data expansion is characterized by comprising the following steps:

step 4, respectively modeling by using BiLstm, ALbert, XLNet according to the processed data set to obtain a Bilstm model, an ALbert model and an XLnet model which can be used for classification;

in the step 3, the expanding of the label marking data of the small sample is specifically as follows:

dividing the text according to sentences, judging whether each sentence contains high-frequency words, if so, reserving the sentence as text characteristics, and splicing sentences containing the high-frequency words to generate a new sample;

in step 3, expanding the small sample label data further includes:

the new sample is directly cut and spliced, so that the small sample data is amplified to be 6 times of the original data;

step 5, collecting target text data, and calling a corresponding model to classify according to the target text data; the method comprises the following steps:

2. The document classification method based on model integration and data expansion according to claim 1, wherein in step 1, text length and text content are used as features to screen text, department names are used as documents with labels to be collected, and a data set of one document corresponding to one or more labels is formed.

3. The method for classifying documents based on model integration and data expansion as claimed in claim 1, wherein in step 2, dirty data in the data set is deleted by deleting department files with the number of characters less than 50 characters and the number of characters greater than 10000 characters.

4. The method of claim 1, further comprising the steps of counting the number of samples contained in each tag and calculating the average value of the samples contained in all tags after the dirty data is deleted in step 2.

5. The document classification method based on model integration and data expansion of claim 1, wherein in step 3, high frequency word lists of different departments are constructed by using TF-IDF modeling.

6. The document classification method based on model integration and data expansion of claim 1, wherein the deleting of the large sample label data is specifically:

7. The document classification method based on model integration and data expansion of claim 6, wherein the number of high frequency words contained in a single text is counted by TF-IDF.