CN116541527B - Document classification method based on model integration and data expansion - Google Patents

Document classification method based on model integration and data expansion Download PDF

Info

Publication number
CN116541527B
CN116541527B CN202310817468.9A CN202310817468A CN116541527B CN 116541527 B CN116541527 B CN 116541527B CN 202310817468 A CN202310817468 A CN 202310817468A CN 116541527 B CN116541527 B CN 116541527B
Authority
CN
China
Prior art keywords
data
text
model
characters
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310817468.9A
Other languages
Chinese (zh)
Other versions
CN116541527A (en
Inventor
汪海涛
代志强
李炳辉
刘会龙
许禹诺
陈斌发
李晖
王登政
郭大勇
林凯
闫晓栋
费长顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sgitg Accenture Information Technology Co ltd
State Grid Beijing Electric Power Co Ltd
Original Assignee
Beijing Sgitg Accenture Information Technology Co ltd
State Grid Beijing Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sgitg Accenture Information Technology Co ltd, State Grid Beijing Electric Power Co Ltd filed Critical Beijing Sgitg Accenture Information Technology Co ltd
Priority to CN202310817468.9A priority Critical patent/CN116541527B/en
Publication of CN116541527A publication Critical patent/CN116541527A/en
Application granted granted Critical
Publication of CN116541527B publication Critical patent/CN116541527B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of intelligent document classification, and particularly relates to a document classification method based on model integration and data expansion. Acquiring a plurality of department files and marking the department names as labels correspondingly; the data set analysis dirty data processing, sample statistics and determination of small samples and large batches of samples; creating department associated hotword tables, expanding small samples and deleting large batches of samples; modeling separately using BiLstm, ALbert, XLNet; and realizing intelligent classification of the official documents through logic judgment. Through the data analysis and establishment department high-frequency word list, text features are screened by utilizing high-frequency words, and the rapid expansion of a data set and sample equalization are realized. The problems of overlong text and time consumption in single model deployment are solved, and the intelligent document classification technology with relatively reasonable operability, high accuracy and short reasoning time is comprehensively realized by means of accuracy and reasoning time interchange based on model integration and logic screening.

Description

Document classification method based on model integration and data expansion
Technical Field
The invention belongs to the field of intelligent document classification, and particularly relates to a document classification method based on model integration and data expansion.
Background
The quantity of various document files transferred every day in an enterprise is huge, and a huge quantity of document files are precisely transferred to tens of working departments, so that a large amount of time is consumed by workers to turn over and sort, and the documents can be accurately transferred to the related departments. The huge effort and lengthy turn-around time always affect the working timeliness. Therefore, by means of artificial intelligence technology, the document intelligent classification flow is realized to the related departments based on semantic information analysis of the relevant contents of the document, and the document intelligent auxiliary classification flow requirement is met.
The document file comprises various functions, notices, work contact lists and the like, and the document text formats, lengths and contents of different departments have great differences. When modeling is performed by using an artificial intelligence technology, the traditional small model RNN, textCNN, biLstm network layer is simple in structure, few in parameter quantity, quick in calculation, high in accuracy and suitable for short text document classification, but if thousands of words or even tens of thousands of words of document texts are used for modeling, prediction and classification, error circulation with extremely high probability is caused by overlong text extraction semantic feature differences. When the method is used for solving the document classification based on the Bert and Albert large model modeling, the defect that the effect of recognizing short documents is good and the effect of recognizing long documents is poor can be overcome compared with a classical RNN series model, but only document documents within 512 characters can be processed, so that the requirement cannot be perfectly solved, the semantic recognition effect of the text model exceeding 2048 characters can be reduced when the XLNet model modeling for processing the ultra-long text is utilized, and compared with the RNN series model, the Bert, albert, XLNet model has reasoning waiting when continuous ultra-long text classification occurs due to overlarge calculation time consumption of the parameter quantity per se in the text prediction calculation. All three modeling modes have the defect that quick and accurate intelligent classification of documents cannot be realized when modeling is carried out independently.
And if the company departments are quite a lot, the number of the documents associated with each department is quite large, and some documents are associated with two departments and three departments even more, compared with the number of the documents of one department, the number of the documents of a plurality of departments is quite different by several times or even tens of times, the number of the documents of one department is quite large, the sample number difference greatly affects model training, when sample balanced enhancement data are processed, text back translation and random word order replacement are commonly used at present, and if the documents are not only related field documents but also document contents are not replaced in a near-meaning manner, the text back translation and the random word order replacement are not available, and the sample balancing and enhancement data are required to be quickly realized by expanding small sample data aiming at the related field documents. Therefore, research and design of an intelligent document classification technology with relatively reasonable operability, improved accuracy and reduced time consumption are needed, and the problems of small document samples and sample equalization of multiple departments are solved.
Disclosure of Invention
The invention provides a document classification method based on model integration and data expansion, which aims to solve the problems that the existing classification counting accuracy is low, the consumed time is high, the rapid and accurate intelligent classification of documents cannot be realized, and the sample equalization cannot be realized.
In order to achieve the above purpose, the invention proposes the following technical scheme:
a document classification method based on model integration and data expansion comprises the following steps:
step 1, collecting a plurality of department files, establishing a data set, and taking the department names as labels of the corresponding files;
step 2, deleting dirty data in the data set, and screening out small sample tags with the sample number smaller than 50 and large sample tags with the sample number larger than 300 in the data set;
step 3, establishing high-frequency word lists of different departments, expanding small sample label data according to the high-frequency word lists of different departments, and deleting large sample label data; obtaining a processed data set;
and 4, respectively modeling by using BiLstm, ALbert, XLNet according to the processed data set to obtain a Bilstm model, an ALbert model and an XLNet model which can be used for classification.
Preferably, in step 1, text length and text content are used as features to screen text, department names are used as labels to mark the collected documents, and a data set of one document corresponding to one or more labels is formed.
Preferably, in step 2, deleting dirty data in the data set is specifically deleting department files with the number of characters less than 50 characters and the number of characters greater than 10000 characters.
Preferably, in step 2, the method further includes the steps of counting the number of samples contained in each tag and calculating the average value of the samples contained in all tags after the dirty data is deleted.
Preferably, in step 3, high-frequency word lists of different departments are constructed by using TF-IDF modeling.
Preferably, in step 3, the expanding of the label labeling data of the small sample specifically includes:
dividing the text according to sentences, judging whether each sentence contains high-frequency words, if the high-frequency words exist, reserving the sentence as text characteristics, and splicing sentences containing the high-frequency words to generate a new sample.
Preferably, in step 3, expanding the small sample label labeling data further includes:
and (3) directly cutting and splicing the new sample, so as to ensure that the data of the small sample is amplified to be 6 times of the original data.
Preferably, the deleting of the label marking data of the large sample is specifically:
counting high-frequency words contained in single texts in a large number of samples, and deleting texts with the number of the high-frequency words lower than 10 to realize first step deletion;
and deleting samples with text lengths exceeding 4096 characters in a large number of samples by taking the sample mean value as a target to realize secondary deletion.
Preferably, the number of high frequency words contained in a single text is counted by TF-IDF.
Preferably, the method further comprises the step 5 of collecting target text data and calling a corresponding model to classify according to the target text data; the method comprises the following steps:
the Bilstm model is used for processing text with the length smaller than 256 characters; the ALbert model is used for processing texts with lengths of more than 256 characters and less than 512 characters; the XLNet model is used to process text with a length greater than 512 characters and less than 2048 characters.
The invention has the advantages that:
establishing a high-frequency word list through data analysis, screening text features by utilizing high-frequency words, and performing small sample data expansion and large-batch sample deletion, thereby realizing rapid expansion of a data set and sample equalization; meanwhile, a multi-model integration method is adopted for modeling, classified and screened texts are processed in a targeted manner, the accuracy and the prediction speed are improved, the problem of difficult modeling caused by text length difference is solved, and an intelligent document classification technology with relatively reasonable operability, high accuracy and short reasoning time is comprehensively realized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:
FIG. 1 is a schematic flow diagram of a document classification method based on model integration and data expansion;
FIG. 2 is a schematic diagram of a Bilstm model iteration F1-score index;
FIG. 3 is a schematic diagram of a Bert model iteration F1-score index;
FIG. 4 is a schematic representation of an XLNet model iteration F1-score index.
Detailed Description
The invention will be described in detail below with reference to the drawings in connection with embodiments. It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
The following detailed description is exemplary and is intended to provide further details of the invention. Unless defined otherwise, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the invention.
Example 1:
referring to fig. 1, the invention provides a document classification method based on model integration and data expansion, which specifically comprises the following steps:
step 1, acquiring a plurality of department files, and correspondingly labeling the department names as labels to obtain a data set;
step 11, collecting and sorting documents of different departments of an enterprise, wherein the collected documents need to cover all file characteristics of the departments, the text length and the text content are taken as characteristic screening and collecting documents, and the quantity balance of the texts of different characteristics is ensured;
and 12, labeling the collected documents by using the department names as labels to form a data set of one document corresponding to one or more labels.
Step 2, carrying out dirty data processing on the data set, carrying out statistics on samples in the data set, and determining small samples and large batches of samples;
step 21, analyzing the data set, and deleting dirty data in the data set; counting the number of single sample characters in a data set, and deleting documents with the number of characters less than 50 characters and the number of characters greater than 10000 characters;
in step 22, the number of samples contained in each tag is counted and the average value (target value of sample equalization) of all the tags is calculated, and small sample tags with the number of samples less than 50 and large sample tags with the number of samples greater than 300 are screened out.
Step 3, creating department associated hotword list, expanding small samples and deleting large batches of samples;
and step 31, analyzing and comparing text contents, wherein if the industrial document content has strong field, the salient features of part of the document keywords are obvious. For example, the document content of the equipment part of the power grid company comprises: the electric pole, the transformer, the insulator, the equipment fault, the equipment abnormality and the flange; the document content of the development planning part comprises: planning a transformer substation, creating a power plant, creating a wind power plant, and a central station and a quality monitoring station; the document texts of different departments comprise different territorial high-frequency words, the high-frequency words with characteristic directivity of each category are extracted by using TF-IDF modeling to construct hot word lists of different departments of the industry, and the hot words of the departments of the industry are used as a condition for sample expansion and deletion;
step 32, small sample data expansion, first step: dividing the text according to sentences, judging whether each sentence contains high-frequency words, if the high-frequency words exist, reserving the high-frequency words as text features, finally splicing the sentences containing the high-frequency words to generate new samples, reserving the sentence meaning and reserving the department text features for one-time expansion;
and a second step of: the small sample direct segmentation and splicing realizes secondary expansion, for example, the characters of the beginning 128 and the end 128 of the original text are reserved (the text is segmented according to 32 when the total length is smaller than 128), the middle part is segmented into four parts, and the random combination of the beginning part, the end part, the middle four parts and the middle four parts can generate four samples with different contents of the same label; at this time, the small sample data is amplified to be 6 times of the original data on the original basis;
step 33, deleting a large number of samples, counting high-frequency words contained in single texts in the large number of samples by using TF-IDF, deleting texts with the number of the high-frequency words lower than 10 to realize first-step deletion, and deleting samples with the text length exceeding 4096 characters in the large number of samples to realize second-step deletion by taking a sample mean value as a target; the integration is such that the number of samples contained in each category in the final dataset is approximately equal.
Step 4, modeling by using BiLstm, ALbert, XLNet respectively;
step 41, modeling by using the bit tm, the text length is less than about 80% of the 256-character prediction accuracy. The F1-score is used as a model evaluation index, and the model iteration F1-score index is shown in a drawing 2:
in step 42, using Bert modeling, the text length is greater than 256 characters and less than 512 characters, and the prediction accuracy is about 76%. The F1-score is used as a model evaluation index, and the model iteration F1-score index drawing 3 shows that:
step 43, using XLNet modeling, the prediction accuracy of text length greater than 512 characters and less than 2048 characters is about 82%. The F1-score is used as a model evaluation index, and a model iteration index F1-score is shown in a drawing 4:
and 44, independently modeling and comparing the three models, wherein the F1-score curve tends to be stable when the predictive evaluation indexes of texts with different lengths are iterated for 20 rounds or 30 rounds, and respectively calling the models on a CPU server to independently predict the categories of 100 documents (the text lengths are less than 256, more than 256 and less than 512, and the number distribution of more than 512 and less than 2048).
The single sample takes about 0.1 second when the length of the text predicted by the Bilstm is smaller than 256 characters of the document, the single text which is larger than 256 characters takes about 1 second, the prediction accuracy can be reduced as the time consumption increases with the increase of the length of the text, the total time consumption of the 100 samples is about 100 seconds when the model is independently modeled, and the model prediction accuracy is about 48%.
When the Albert prediction reasoning length is smaller than 512 centimeters, a single sample takes about 1.5 seconds, when the text with the length larger than 512 is predicted, the accuracy of the extracted feature is reduced due to the limited length, the total time of comprehensively reasoning 100 samples is about 150 seconds, and the model prediction accuracy is about 56%. Compared with the Bilstm modeling accuracy, the method improves 8%, and the reasoning time consumption is increased by 1.5 times.
When the XlNet prediction reasoning length is smaller than 1024-character documents, a single sample takes about 2 seconds, when the reasoning time of the single text with larger than 1024 characters is increased by about 12 seconds at maximum along with the increase of the text length, the prediction accuracy is increased by about 70% along with the increase of the text length, and the total time of reasoning about 1200 seconds is taken for 100 samples. Modeling accuracy is highest compared to ALbert, biLstm, but reasoning takes about 8-fold and 1-fold more time.
Step 5, after the model is established, intelligent classification of the official document is realized according to logic judgment and model integration deployment;
and adopting logic judgment to screen proper models to predict document files with different lengths, calling the Bilstm when the length of the input text is smaller than 256 characters, and calling the ALbert model when the length of the input text is larger than 256 characters and smaller than 512 characters, so as to solve the problem that the effect of processing long text by the Bilstm is poor, and using time to change the accuracy. When the length of the incoming text is larger than 512 characters and smaller than 2048 characters, the XLNet model is called, and the limitations that the effect of processing long text by using the Bilstm is poor and the ALbert can only process 512 characters are solved by using the time-varying accuracy.
When the three models are deployed independently, the time consumption for reasoning is high, when the time consumption for reasoning is short, the accuracy is low, the requirements of on-line document classification cannot be met completely, the three models BiLstm, ALbert, XLNet are integrated to realize document classification, a logic screening model is used before a text is transmitted into the model and then is called, 100 documents are utilized to screen the random calling model according to logic, the total time consumption for reasoning is about 420 seconds, the accuracy is about 78%, and the time and accuracy are exchanged to comprehensively shorten the time and improve the accuracy. Compared with BiLstm, ALbert, XLnet, the independent modeling accuracy is high, the reasoning time is reduced by about 1/3 compared with that of XLnet independent modeling, and the GPU acceleration reasoning time is improved to be 1/4 of that of a CPU.
The invention has the advantages that:
based on artificial intelligence, intelligent text classification is realized, and manual turning classification circulation official documents are replaced.
The problems of overlong text and time consumption in single model deployment are broken through, and the intelligent document classification technology with relatively reasonable operability, high accuracy and short reasoning time is comprehensively realized by means of accuracy and reasoning time interchange based on model integration and logic screening.
The high-frequency word list is established through data analysis, the text features are screened by utilizing the high-frequency words, so that the small sample data expansion and the large sample deletion are realized, and the problems of rapid expansion of a data set and sample equalization are realized.
It will be appreciated by those skilled in the art that the present invention can be carried out in other embodiments without departing from the spirit or essential characteristics thereof. Accordingly, the above disclosed embodiments are illustrative in all respects, and not exclusive. All changes that come within the scope of the invention or equivalents thereto are intended to be embraced therein.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims (7)

1. A document classification method based on model integration and data expansion is characterized by comprising the following steps:
step 1, collecting a plurality of department files, establishing a data set, and taking the department names as labels of the corresponding files;
step 2, deleting dirty data in the data set, and screening out small sample tags with the sample number smaller than 50 and large sample tags with the sample number larger than 300 in the data set;
step 3, establishing high-frequency word lists of different departments, expanding small sample label data according to the high-frequency word lists of different departments, and deleting large sample label data; obtaining a processed data set;
step 4, respectively modeling by using BiLstm, ALbert, XLNet according to the processed data set to obtain a Bilstm model, an ALbert model and an XLnet model which can be used for classification;
in the step 3, the expanding of the label marking data of the small sample is specifically as follows:
dividing the text according to sentences, judging whether each sentence contains high-frequency words, if so, reserving the sentence as text characteristics, and splicing sentences containing the high-frequency words to generate a new sample;
in step 3, expanding the small sample label data further includes:
the new sample is directly cut and spliced, so that the small sample data is amplified to be 6 times of the original data;
step 5, collecting target text data, and calling a corresponding model to classify according to the target text data; the method comprises the following steps:
the Bilstm model is used for processing text with the length smaller than 256 characters; the ALbert model is used for processing texts with lengths of more than 256 characters and less than 512 characters; the XLNet model is used to process text with a length greater than 512 characters and less than 2048 characters.
2. The document classification method based on model integration and data expansion according to claim 1, wherein in step 1, text length and text content are used as features to screen text, department names are used as documents with labels to be collected, and a data set of one document corresponding to one or more labels is formed.
3. The method for classifying documents based on model integration and data expansion as claimed in claim 1, wherein in step 2, dirty data in the data set is deleted by deleting department files with the number of characters less than 50 characters and the number of characters greater than 10000 characters.
4. The method of claim 1, further comprising the steps of counting the number of samples contained in each tag and calculating the average value of the samples contained in all tags after the dirty data is deleted in step 2.
5. The document classification method based on model integration and data expansion of claim 1, wherein in step 3, high frequency word lists of different departments are constructed by using TF-IDF modeling.
6. The document classification method based on model integration and data expansion of claim 1, wherein the deleting of the large sample label data is specifically:
counting high-frequency words contained in single texts in a large number of samples, and deleting texts with the number of the high-frequency words lower than 10 to realize first step deletion;
and deleting samples with text lengths exceeding 4096 characters in a large number of samples by taking the sample mean value as a target to realize secondary deletion.
7. The document classification method based on model integration and data expansion of claim 6, wherein the number of high frequency words contained in a single text is counted by TF-IDF.
CN202310817468.9A 2023-07-05 2023-07-05 Document classification method based on model integration and data expansion Active CN116541527B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310817468.9A CN116541527B (en) 2023-07-05 2023-07-05 Document classification method based on model integration and data expansion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310817468.9A CN116541527B (en) 2023-07-05 2023-07-05 Document classification method based on model integration and data expansion

Publications (2)

Publication Number Publication Date
CN116541527A CN116541527A (en) 2023-08-04
CN116541527B true CN116541527B (en) 2023-09-29

Family

ID=87454535

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310817468.9A Active CN116541527B (en) 2023-07-05 2023-07-05 Document classification method based on model integration and data expansion

Country Status (1)

Country Link
CN (1) CN116541527B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354818A (en) * 2016-08-30 2017-01-25 电子科技大学 Dynamic user attribute extraction method based on social media
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character
CN111143551A (en) * 2019-12-04 2020-05-12 支付宝(杭州)信息技术有限公司 Text preprocessing method, classification method, device and equipment
CN114428854A (en) * 2021-12-20 2022-05-03 成都信息工程大学 Variable-length text classification method based on length normalization and active learning
CN115858774A (en) * 2022-06-08 2023-03-28 北京中关村科金技术有限公司 Data enhancement method and device for text classification, electronic equipment and medium
CN116150010A (en) * 2023-02-23 2023-05-23 南京慕测信息科技有限公司 Test case classification method based on ship feature labels

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8805861B2 (en) * 2008-12-09 2014-08-12 Google Inc. Methods and systems to train models to extract and integrate information from data sources

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354818A (en) * 2016-08-30 2017-01-25 电子科技大学 Dynamic user attribute extraction method based on social media
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character
CN111143551A (en) * 2019-12-04 2020-05-12 支付宝(杭州)信息技术有限公司 Text preprocessing method, classification method, device and equipment
CN114428854A (en) * 2021-12-20 2022-05-03 成都信息工程大学 Variable-length text classification method based on length normalization and active learning
CN115858774A (en) * 2022-06-08 2023-03-28 北京中关村科金技术有限公司 Data enhancement method and device for text classification, electronic equipment and medium
CN116150010A (en) * 2023-02-23 2023-05-23 南京慕测信息科技有限公司 Test case classification method based on ship feature labels

Also Published As

Publication number Publication date
CN116541527A (en) 2023-08-04

Similar Documents

Publication Publication Date Title
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN107273295B (en) Software problem report classification method based on text chaos
CN110910175B (en) Image generation method for travel ticket product
CN113468317B (en) Resume screening method, system, equipment and storage medium
KR20200127557A (en) A program recording midium for an automatic sentiment information labeling method to news articles for providing sentiment information
CN112685374B (en) Log classification method and device and electronic equipment
CN114491034B (en) Text classification method and intelligent device
CN115600109A (en) Sample set optimization method and device, equipment, medium and product thereof
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN112579730A (en) High-expansibility multi-label text classification method and device
KR20200127587A (en) A program for an automatic sentiment information labeling to news articles for providing sentiment information
KR20200127553A (en) An automatic sentiment information labeling method to news articles for providing sentiment information
CN116541527B (en) Document classification method based on model integration and data expansion
CN115618264A (en) Method, apparatus, device and medium for topic classification of data assets
KR20200127555A (en) A program for an automatic sentiment information labeling to news articles for providing sentiment information
KR20200127636A (en) A program recording midium for an automatic sentiment information labeling to news articles for providing sentiment information
CN111400375A (en) Business opportunity mining method and device based on financial service data
CN110895564A (en) Potential customer data processing method and device
KR20200127552A (en) An automatic sentiment information labeling method to news articles for providing sentiment information and an apparatus using it
KR20200127670A (en) An apparatus for an automatic sentiment information labeling method to news articles for providing sentiment information
KR20200127590A (en) An apparatus for automatic sentiment information labeling to news articles
KR20200127654A (en) A operating method for an automatic sentiment information labeling apparatus to news articles
KR20200127589A (en) An apparatus for automatic sentiment information labeling to news articles
KR20210001693A (en) A rcording media for recording program for providing a corporate insolvencies information based on automatic sentiment information labelings
KR102609132B1 (en) Method and apparatus for automatically constructing sentiment dictionary

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant