CN117251524A - Short text classification method based on multi-strategy fusion - Google Patents
Short text classification method based on multi-strategy fusion Download PDFInfo
- Publication number
- CN117251524A CN117251524A CN202310446513.4A CN202310446513A CN117251524A CN 117251524 A CN117251524 A CN 117251524A CN 202310446513 A CN202310446513 A CN 202310446513A CN 117251524 A CN117251524 A CN 117251524A
- Authority
- CN
- China
- Prior art keywords
- data
- label
- training
- word
- keyword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 230000004927 fusion Effects 0.000 title claims abstract description 28
- 238000007781 pre-processing Methods 0.000 claims abstract description 21
- 238000013528 artificial neural network Methods 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 88
- 238000000605 extraction Methods 0.000 claims description 30
- 230000001965 increasing effect Effects 0.000 claims description 22
- 238000004422 calculation algorithm Methods 0.000 claims description 19
- 238000013507 mapping Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 18
- 230000002441 reversible effect Effects 0.000 claims description 16
- 238000002372 labelling Methods 0.000 claims description 14
- 238000012216 screening Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 13
- 230000006978 adaptation Effects 0.000 claims description 12
- 102100036790 Tubulin beta-3 chain Human genes 0.000 claims description 10
- 238000007405 data analysis Methods 0.000 claims description 8
- 230000014509 gene expression Effects 0.000 claims description 8
- 238000004140 cleaning Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 230000006399 behavior Effects 0.000 claims description 4
- 238000009825 accumulation Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 230000003247 decreasing effect Effects 0.000 claims description 3
- 238000009826 distribution Methods 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 3
- 239000003607 modifier Substances 0.000 claims description 3
- 238000012805 post-processing Methods 0.000 claims description 3
- 238000002360 preparation method Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000005406 washing Methods 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 abstract description 8
- 230000000694 effects Effects 0.000 abstract description 4
- 238000004891 communication Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 10
- 238000007635 classification algorithm Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 7
- 238000013519 translation Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 238000009960 carding Methods 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000002829 reductive effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000003416 augmentation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a short text classification method based on multi-strategy fusion, belongs to the field of natural language processing, and mainly relates to deep neural networks, data enhancement and text classification. The method comprises the following steps: noise data is removed through data preprocessing, classification is carried out based on part-of-speech tagging keywords, text classification based on data enhancement is carried out, and finally a corresponding threshold value threshold is set through multi-strategy fusion to obtain a network short text data tag. According to the short text classification method based on multi-strategy fusion, the short text data classification effect is improved, and the accuracy and the service efficiency of finding related short text data by service personnel are improved.
Description
Technical Field
The invention relates to a short text classification method based on multi-strategy fusion, belongs to the technical field of natural language processing, and mainly relates to deep neural networks, data enhancement and text classification.
Background
With the development and growth of network and communication technologies, network short text has become an important way for people to interact with information. However, many junk short text data exist in the network, and in order to ensure the user's own experience, from the data mining perspective, the method can avoid the trouble of the junk short text while mining valuable information, and is a real problem to be solved at present.
The short text classification at present mainly depends on a neural network model, but the neural network model usually needs a large amount of training data if good performance is to be obtained. Having more training data helps the model reduce the over-fitting problem and improves robustness. However, obtaining large amounts of labeled data generally requires more effort and longer time to label. Data enhancement is a useful synthetic data generation technique that is widely used for computer vision and speech recognition. But due to the complexity of languages, it is more challenging to apply data enhancement techniques to natural language processing tasks (NLPs). One successful method of data enhancement in NLP is back-translation, i.e., translating a monolingual sentence from a target language to a source language using a translation model to generate a semantically identical parallel sentence. Other useful methods include: reordering certain nodes in the annotated data based on dependency syntax to generate new data, generating synonymous questions using a knowledge base, and the like. In addition, synonym substitution, random deletion/exchange/insertion, generation using VAE or pre-trained language models may also be used for some NLP tasks, but the enhancement is not good.
The short text data classification is to automatically analyze the content contained in the text by using an algorithm so as to identify the subject class of different texts, and the corresponding class can be rapidly judged by carrying out batch processing on a large amount of text data so as to accurately classify the text data (aiming at inputting n texts, the texts are classified into one or more of m classes). Short text classification is typically classified into supervised text classification such as bert-based classification and unsupervised topic clustering, such as LDA-based topic clustering algorithms.
The technical scheme of the prior art will be described in detail as follows: 1. the method comprises the steps of (1) based on an LDA (model description architecture) topic clustering algorithm, a LDA topic model clustering technology based on the LDA utilizes a feature extraction and word co-occurrence model, and through topic feature word weighting, an LDA short text clustering algorithm research is conducted in depth, specifically, firstly, aiming at the problem that a traditional LDA topic model does not fully consider topic words and utilizes feature extraction related problems, an LDA short text clustering algorithm based on topic word co-occurrence and knowledge to feature extraction is provided, and a word bag model based on topic word co-occurrence is built in the LDA topic model to generate a topic knowledge set; injecting the generated topic knowledge set into an LDA model for feature extraction, and iteratively extracting semantic knowledge to achieve the topic and semantic joint cluster analysis effect; secondly, aiming at the problem that the relativity between the subject words and the words is ignored by some documents, a subject feature word clustering algorithm based research is provided, the relativity between the subject words and the words is fully considered, a more perfect subject word bag is built, and related problems are defined; 2. based on the bert short text classification, preprocessing the integrated long text through word segmentation, word stopping removal and other operations; calculating words of all texts by means of TF-IDF, and taking TF-IDF values of all words as weight values of the words at word vectors; performing dimension reduction processing on the dimension of the word vector by using a gradient dimension reduction method; finally, the word vectors are classified and trained by using a traditional machine learning method to obtain a classification model for classifying short texts.
Disclosure of Invention
Disadvantages of the prior art: the method has the following defects: the short text data has short space and scarce characteristics, and the keyword strategy has high recall rate, so that a large amount of noise information is easy to bring; although the text classification can be used for acquiring the data tag relatively accurately, the data recall rate is poor, and important information is easy to miss; and the second disadvantage is that: in the practical application process, nouns and verbs importance are not highlighted, and word-part words such as adverbs, adjectives, pronouns, conjunctions and sub-classes are not effective words in the language, only play a modifying role, and the word importance degree is calculated by the traditional TF-IDF algorithm so as to easily bring a large amount of noise information; and the third disadvantage is: the short text data is unbalanced in category and scarce in data, more noise is inevitably introduced by using the model reverse labeling data with weaker capability in the conventional data enhancement mode, and a large amount of labeled data and knowledge are required by using the model reverse labeling data with weaker capability, otherwise, the reverse labeling data is likely to have larger noise; synonym replacement typically relies on additional knowledge, such as a dictionary of manual designs such as WordNet, but this approach may have low coverage or be directly unusable for low resource languages.
The technical problems solved by the invention are as follows: aiming at solving the problem 1, the invention provides a whole set of short text classification method based on data enhancement, which improves the accuracy of short text data classification. Aiming at solving the problem 2, in the process of extracting keywords based on short text data, in order to obtain keywords most relevant to labels, a TF-IDF calculation method is improved, part-of-speech tagging strategies are fused to pay attention to nouns in the data, and verb data are combined to complete proper noun combination, so that the obtained keywords are more consistent with extraction logic, meanwhile, useless information such as common non-sensitive words is removed in a man-machine cooperation mode, and more accurate single keywords are obtained; reverse keywords are obtained through data analysis and carding, and keyword screening accuracy is improved; aiming at solving the problems 3, in order to solve the problems of unbalanced classification of text classification data and the like, a text classification algorithm solution with data enhancement is provided, and the BERT pre-training language model of the field common sense is adopted to fuse the field common sense and the knowledge triplet fine adjustment BERT model to complete the field data adaptation; data enhancement is carried out by combining with an entity enhancement mode, so that data diversity is increased; completing data expansion work by means of an entity extraction algorithm and a self-constructed map mapping table; and finally, completing text classification of the data after data enhancement based on the pre-training model and the bert+softmax model, thereby improving the text classification accuracy.
The invention adopts the following technical scheme: a short text classification method based on multi-strategy fusion comprises the following steps:
step SS1: finishing data marking by combing the marked short text data, and generating short text communication data;
step SS2: preprocessing the short text communication data obtained in the step SS1 to generate standardized data after text preprocessing;
step SS3: classifying TF-IDF keywords based on part of speech tagging on the standardized data after text preprocessing, calculating the relativity of words and all sentences in a library, and acquiring a keyword list in a mode of sorting and man-machine collaborative removal of very sensitive common words; carrying out data analysis on the standardized data after text preprocessing, extracting and outputting a combined keyword and a reverse keyword list, and configuring corresponding weights to prepare for subsequent content intention recognition and data screening;
step SS4: aiming at the standardized data after text preprocessing, adopting a BERT pre-training language model of the domain common sense to fuse the domain common sense and a knowledge triplet fine tuning BERT model to complete the domain data adaptation; carrying out data enhancement by adopting a data enhancement algorithm based on entity enhancement, increasing data diversity, and completing data quantity enhancement by adopting a fusion knowledge graph data enhancement algorithm by means of entity extraction and self-constructed graph mapping tables; finally, merging entity enhancement data and data to be processed to finish text classification based on a pre-training model and a bert+softmax model;
Step SS5: and outputting the data category labels and the probabilities according to the classification results of the step SS3 and the step SS 4.
As a preferred embodiment, the step SS2 specifically includes: matching special characters and sensitive words containing deformed characters and expression packages by adopting regular expressions; the short text communication data aiming at the anti-interception behavior adopts regular matching keywords for replacement; aiming at network, activation and traditional Chinese characters, establishing a related word stock, and then using regular matching and replacing; and uniformly adjusting the traditional Chinese characters and the upper and lower English characters into Chinese simplified and lower English characters, and stopping word processing and removing invalid characters and website address information.
As a preferred embodiment, the step SS3 specifically includes: calculating TF of each word and IDF among words according to standardized data after text preprocessing, calculating TF-IDF values, sorting to obtain keywords, removing very sensitive words by man-machine cooperation, and obtaining a keyword list, wherein TF represents word frequency, the calculating method is TF=the number of times that a certain word appears in an article/the total word number of the article, IDF represents inverse document frequency, the calculating method is IDF=log (the total number of documents in a corpus/(the number of documents containing a certain word+1)), and after the TF word frequency and the IDF inverse document frequency are obtained, the two numbers are multiplied to obtain the TF-IDF value of a word; and analyzing historical category label data according to the standardized data after text preprocessing, extracting category label key features, carrying out repeated experimental iteration updating to obtain a combined keyword, and finally outputting the keyword and a combined keyword list.
As a preferred embodiment, the step SS3 specifically further includes: removing adverbs, adjectives, pronouns, conjunctions and noise reduction processing of sub-category part-of-speech words of the adverbs, the adjectives, the conjunctions and the sub-category part-of-speech words of the adjectives are carried out on the standardized data after text preprocessing, and the computational formula of TF is improved as follows:
the TF-IDF value of a shorter length word is increased by decreasing the word frequency of a longer length word.
As a preferred embodiment, the step SS3 specifically further includes: corresponding TF-IDF values are calculated for all words in the document, common word data filtering is carried out on the keywords sequenced by the TF-IDF values in a man-machine cooperation mode, a single keyword word list is output, and corresponding weights h are configured according to the importance degree of the keywords Single sheet The method comprises the steps of carrying out a first treatment on the surface of the Extracting short words with TF-IDF values exceeding a set threshold value from each category, calculating combined keywords, searching the extracted nouns in an original database, obtaining noun and verb combined keywords, and using (the number of the combined words) -1 Reducing probability values of a plurality of combined words, and calculating TF-IDF values of the combined keywords by combining the following formulas:
finally, carrying out multi-round analysis on the data by combining with data analysis, extracting key characteristics of category labels, carrying out repeated experimental iteration update, outputting a combined key word and a reverse key word list, and configuring corresponding weight h Group of 、h Reverse-rotation The data screening strategy is as follows:
Score keyword(s) =h Single sheet *Score Single sheet +h Group of *Score Group of -h Reverse-rotation *Score Reverse-rotation
Preparation for subsequent content intent recognition, data screening, where h Group of Configuring weights for combined keywords, h Reverse-rotation Configuring weights for reverse keyword tables, h Single sheet Configuration weights for individual keywords, score Single sheet 、Score Group of 、Score Reverse-rotation Score results of the individual keywords, the combined keywords, and the reverse keyword table, respectively Keyword(s) Comprehensive scoring results for TF-IDF keyword classification.
As a preferred embodiment, the step SS4 specifically includes: the data composition of the pre-training model comprises field data, a field common sense library and a knowledge triplet, wherein the BERT pre-training language model adopting the field common sense fuses the field common sense and the knowledge triplet fine tuning BERT model to complete field data adaptation specifically comprises the following steps: fusing the domain common sense information and the knowledge triples into pre-training data so that the pre-training model has corresponding expert knowledge; instead of performing MASK at the original word level, performing MASK at the word level, integrating more priori knowledge, and reducing damage to the word structure specific to the Chinese; training is not carried out on the text with 128 length firstly by adopting the original method, then fine adjustment is carried out on the length of 64 in two stages, and training is carried out on the sequence length of 64 directly, so that the model can adapt to short text; adjusting gradient accumulation super-parameters to enable a model batch to be accumulated to 64, and calculating a loss once to simulate a subsequent real training scene; a larger batch size is adopted, and a longer training step length is adopted; removing the original lower sentence prediction loss function; and inputting the data subjected to the adaptation processing into a pre-training model to obtain the field BERT pre-training model by training 80 ten thousand steps with the learning rate of 2 e-6.
As a preferred embodiment, the data enhancement performed by the entity-based data enhancement algorithm in step SS4 specifically includes:
linearizing the labeled sentence, training a language model on the linearization data and generating new labeling data, wherein in the linearization process, the label is inserted before the corresponding words, so that the label is regarded as modifier of the words; for O labels frequently appearing in the sequence labeling task, deleting the O labels when linearizing labeling data; after the marking data are linearized, special marks [ BOS ] and [ EOS ] are added at the beginning and the end of each sentence respectively, and the special marks [ BOS ] and [ EOS ] are used for marking sentence boundaries to help the language model complete training and data generation;
after the linearization of the labeling data is completed, a language model is used for learning the distribution of words and labels, a random mask mechanism is added to input data aiming at non-label data, and the data diversity is increased on the basis of not changing the semantics and the framework of the original data; generating a token embedding more in accordance with the real data logic by means of the fine-tuned pre-training model; constructing a language model by combining the BiLSTM cyclic neural network, fusing context information, and generating data;
Training a language model by maximizing the probability of the next token prediction, given a sentence, first a token sequence (w 1 ,w 2 ,...,w N ) Random mask to generate R t =Randon_mask(w t ) Then (R) 1 ,R 2 ,...,R N ) Inputting the bert pre-training model to obtain the Token email (x) 1 ,x 2 ,...,x N ) Where N is the sequence length, input BiLSTM to generate hidden states h at each location t t =LSTM(x 1 ,x 2 ,...,x t-1 ) And for each hidden state h t Applying dropout to generate d' t =dropout(h t ) Finally, the linear layer+softmax layer is used to predict the next token in the sequence;
after training a language model, generating synthetic training data for a marking task by using the trained language model, inputting a [ BOS ] token into the language model in the generation process, calculating the probability of all token sequences in a word list according to a formula, and generating sentences by successive autoregressive after the [ BOS ] is given, wherein the token sequence generated in the last step is used as input to generate a next token sequence, so that all subsequent token sequences are acquired one by one; the generated text sequence is in a linear format, and the data post-processing is completed in a rule-based mode: washing the sentences without labels; cleaning sentences of which all words are [ UNK ]; cleaning sentences with incorrect label sequence; the label conflict data and the duplicate data are purged.
As a preferred embodiment, the enhancing of the data quantity by using the fusion knowledge graph data enhancing algorithm in step SS4 with the help of entity extraction and self-constructed graph mapping table specifically includes:
based on the bert+bilstm+self-intent+crf, the extraction of specific tag entities is completed, the batch replacement of data is completed through hard matching dictionary mapping by combining an existing knowledge base, and the data quantity is increased;
in the specific tag entity extraction, text feature extraction is carried out based on a BERT pre-training model to obtain a word granularity vector matrix, the context information is learned by means of a BILSTM model, a plurality of self-attention mechanisms are adopted, original information is input into different spaces, context semantic information is obtained in a multi-dimensional mode, a plurality of sets of attention parameters are shared in the training process, key text features are extracted in an important mode, and the context semantic information is learned; finally, the CRF processes the output sequence of the BiLSTM, and a global optimal sequence is obtained according to the label between adjacent CRF by combining the state transition matrix in the CRF, so as to complete entity extraction;
the extracted entities are mapped through a hard matching dictionary to complete data batch replacement, and data enhancement based on knowledge maps is completed;
finally, completing the text classification task by using the result bert+softmax to obtain score Classification =[S 5 ,S 6 ,S 7 ,S 8 ]。
As a preferred embodiment, the step SS5 specifically includes:
model iteration updating is independently carried out on keywords, data intention recognition and similarity in a training stage, more accurate theme labels are required to be obtained in a testing stage, and a strategy score is required to be obtained in a testing stage Keyword(s) =[S 1 ,S 2 ,S 3 ,S 4 ]、score Classification =[S 5 ,S 6 ,S 7 ,S 8 ]Setting corresponding weights according to the contribution degree, and obtaining the final h through multiple experimental tuning Keyword(s) 、h Classification The method comprises the steps of carrying out a first treatment on the surface of the Through S 1 *h Keyword(s) +S 5 *h Classification Obtaining each class of tag scores using max (S 1 *h Keyword(s) +S 5 *h Classification ) The maximum tag score is obtained by comparing max [ max (S 1 *h Keyword(s) +S 5 *h Classification ),Threshold]If the class label score is greater than the threshold, the data belongs to the corresponding label, otherwise belongs to other classes, and the final score label is as follows:
score total =[max[max(S 1 *h keyword(s) +S 5
*h Classification ),Threshold],……]
=[label 1 ,,label 2 ,……,label n ]
Take [ label 1 ,,label 2 ,……,label n ]And if the mode is the other label, taking the secondary mode as the selected label, otherwise taking the other label as the label of the scheme, setting a mode threshold at the same time, outputting the label when the label is larger than the mode threshold, and otherwise, still outputting the other label.
The invention has the beneficial effects that: the scheme provides a short text classification solution based on data enhancement and a text classification algorithm solution fused with data enhancement aiming at the problems of unbalanced text classification data types and the like by applying a deep learning method, and adopts a BERT pre-training language model of field common sense to fuse the field common sense and knowledge triplet fine-tuning BERT model to complete field data adaptation; carrying out data enhancement by combining an entity enhancement mode, increasing data diversity, and completing batch data replacement by means of mapping of a hard matching dictionary by means of a map mapping table (self-construction, setting a plurality of replaceable entity dictionaries such as mahjong playing, mahjong pushing, golden flower frying and the like in a class D dictionary, and increasing data volume); finally, the entity and the data to be processed are collected to finish text classification based on the pre-training model and the bert+softmax model, and further the text classification accuracy is improved. Compared with the prior art, the scheme has the advantages that: the method has the advantages that 1, the Bert pre-training model is utilized, the adaptation problem of the short text data of the pre-training language model is improved by fusing field data, common sense information and knowledge triples, and model field adaptation is provided for text classification, data enhancement and element extraction; 2, extracting single keywords, combined keywords and reverse keywords by using multi-strategy fusion, and guaranteeing data screening and topic extraction accuracy from different dimensionalities; 3, utilizing two modes of entity enhancement and map enhancement to pointedly complete data expansion, thereby improving the performance of the classification model; 0 advantage 4, the multi-strategy fusion short text data classification method is established, the data category labels are obtained from different dimensions, and the optimal theme category is obtained by setting corresponding threshold thresholds.
Drawings
FIG. 1 is a flow chart of a short text classification method based on multi-policy fusion of the present invention.
Fig. 2 is a flow chart of the TF-IDF keyword classification algorithm based on part-of-speech tagging of the present invention.
FIG. 3 is a flow chart of the fused data enhanced text classification algorithm of the present invention.
FIG. 4 is a schematic illustration of entity-based enhancement tagging in accordance with a preferred embodiment of the present invention.
FIG. 5 is a schematic diagram of the topology of a language model of a preferred embodiment of the present invention.
Fig. 6 is a topological schematic diagram based on knowledge-graph data enhancement of a preferred embodiment of the invention.
Fig. 7 is a schematic diagram of text classification in accordance with a preferred embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Aiming at the technical problems, the invention provides a short text classification method based on multi-strategy fusion, and the general framework diagram of the solution is shown in the following figure 1. Noise data is removed through data preprocessing, keyword classification is carried out based on part of speech tagging, text classification algorithm is carried out based on data enhancement, and finally a short text communication data tag is obtained through multi-strategy fusion setting of corresponding threshold thresholds.
1. Data morphology: the data annotation is accomplished by manually combing the annotated short text data (text data by default).
2. Data preprocessing: along with the development of science and technology and the popularity of network languages, the content of short text communication data is also various, and mainly comprises characters, pictures, expressions, websites and the like, wherein the characters not only comprise written words, but also comprise life words, network words and the like. Short text communication data is different from common texts in daily life, the content of the short text communication data is more activated and has weak logic, and the characteristics of the short text communication data can have great influence on subsequent filtering, so that the content of the short text communication data is required to be preprocessed to adapt to the classification requirement.
Preprocessing text data, which mainly comprises the steps of cleaning and normalizing the data, and is an important step of constructing a deep learning model: (1)) special words: the spam text communication data content can contain more special characters, and is various in variety, including deformed characters, expression packages and the like, for example: "DEGC", "m/s", and "≡A", etc., regular expression matching is required for special words and sensitive words, for example "≡A" is replaced by "<" - "; (2) anti-intercept behavior: for avoiding the filtering of the spam short text communication data, many spammers can perform special treatment on some vocabularies, such as 'Weixin', 'C0 m' and 'provide 4 and 6-level trial-percent question-answering', and for anti-interception behaviors, regular matching keywords are needed to be replaced; (3) network, activation and traditional: short text communication data is used as an emotion communication carrier for people to people, is rich and various in content, contains network and living expressions, and often contains traditional Chinese characters such as ' giving force ', ' children boots ', ' old people and the like. Aiming at words such as network, activation and the like, related word libraries can be established and then are matched and replaced by regular rules; (4) data normalization: the complex characters and the upper and lower English characters are uniformly adjusted into Chinese simplified characters and lower English characters, and the stop word processing and the removal of invalid characters and website address information are performed.
TF-IDF keyword classification based on part of speech tagging (merging parts of speech for keyword recognition). The invention uses the improved TF-IDF algorithm to calculate the relativity of the words and all sentences in the library, and obtains the keyword list by sorting and man-machine cooperation to remove the very sensitive common words. Because a large amount of noise data can be brought by single keyword matching, category label key features are extracted through manual data analysis, repeated experiment iteration updating is performed, combined keywords (when two combined keywords are simultaneously generated, corresponding category weights are increased, data attribution is judged), a reverse keyword list (when the keywords are generated, other category weights are increased) are output, and corresponding weights are configured to prepare for subsequent content intention recognition and data screening. The specific flow of keyword extraction is shown in fig. 2.
TF-IDF is a common technique for information retrieval and data mining. Where TF represents word frequency, the calculation method is tf= (number of times a word appears in an article/total number of words in the article), IDF represents inverse document frequency, and the calculation method is idf=log (total number of documents in corpus/(number of documents containing a word+1)). After having TF (word frequency) and IDF (inverse document frequency), the two numbers are multiplied to obtain the TF-IDF value of a word. The larger the TF-IDF of a word in an article, the more important the word will generally be in that article.
However, the general TF-IDF algorithm has some problems, such as that the data usually contains nouns, verbs, adverbs, adjectives, pronouns, conjunctions and sub-classes thereof, and only verbs and nouns play a leading role in the data (central thought), and the words of the verbs, adjectives, pronouns, conjunctions and sub-classes thereof are not valid words in the language and only have a modifying role; therefore, when TF is calculated, stop words, single words and adverbs, adjectives, pronouns, conjunctions, sub-class and other word-part words in the data are removed firstly, so that less data noise is generated; only two types of words of nouns and verbs exist in the data after noise reduction; firstly, counting TF-IDF values of all nouns, and ignoring the length of a word by a traditional TF-IDF algorithm, wherein the words are all identical in terms of computing TF regardless of the fact that the words contain several words. This is not very applicable in the context of the present invention. Because our data is shorter and focus is on text with shorter segmentation words, the longer words have no benefit to the extraction of keywords in this case. Therefore, when the TF-IDF of the words is calculated, the formula of the TF is improved as follows:
the TF-IDF value of a shorter length word is increased by decreasing the word frequency of a longer length word.
Corresponding TF-IDF values are calculated for all words in the document, common word data filtering is carried out on the keywords sequenced by the TF-IDF values in a man-machine cooperation mode, a single keyword word list is output, and corresponding weights h are configured according to the importance degree of the keywords Single sheet The method comprises the steps of carrying out a first treatment on the surface of the However, the matching of single keywords cannot meet the data screening requirement, in order to improve the data screening efficiency and reduce the noise data level, the invention extracts short words with larger TF-IDF value (the extraction quantity can be manually customized and the man-machine cooperation) from each category to calculate combined keywords, and searches the extracted nouns in the original database to obtain nouns and verb combined keywords (two combined words or a plurality of combined words) for use (the number of the combined words) -1 The probability values of a plurality of combined words (data with a small number of combined words are focused on) are reduced, and the TF-IDF values of the combined keywords are calculated by combining the following formulas.
Finally, the data analysis is combined, the data is subjected to multi-round analysis, category label key features are extracted, repeated experiment iteration update is carried out, and combined keywords (when the two combined keywords simultaneously appear, the combination keywords are matched with each other)The category weight is increased, the attribution of data is judged), a reverse keyword list (when the word appears, the weight of other categories is increased), and the corresponding weight h is configured Group of 、h Reverse-rotation Preparation is made for subsequent content intent recognition and data screening. The data screening strategy is as follows:
Score keyword(s) =h Single sheet *Score Single sheet +h Group of *Score Group of -h Reverse-rotation *Score Reverse-rotation
Fusion data enhanced text classification algorithm: aiming at the problems of unbalanced text classification data types and the like, the invention provides a solution of a text classification algorithm with fused data enhancement, adopts a BERT pre-training language model of common knowledge in the field to fuse common knowledge in the field and a knowledge triplet fine tuning BERT model to complete the field data adaptation; carrying out data enhancement by combining an entity enhancement mode, increasing data diversity, and completing data quantity enhancement by means of entity extraction and self-constructed map mapping tables (a plurality of replaceable entity dictionaries such as mahjong playing, mahjong pushing, golden flower frying and the like exist in a class D dictionary, and data batch replacement and data quantity increase are completed through hard matching dictionary mapping); finally, the entity and the data to be processed are collected to finish text classification based on the pre-training model and the bert+softmax model, and further the text classification accuracy is improved. The specific technical scheme is described in detail as follows, and the flow chart is shown in fig. 3.
(1) BERT pre-training language model fusing general knowledge of the field: the corpus required by the pre-training language model is a label-free Chinese corpus, the scheme is based on WWM (Whole Word Masking) whole word coverage pre-training BERT model which is deduced by a Haw large-message flying combined laboratory as a basic model, the corpus of the training BERT model comprises Chinese wikipedia, hundred degrees encyclopedia, news report, chinese question-answer data and the like, but the corpora are general corpora in the public field, and the corpus suitability in a user document library in the field is not good enough, so that the scheme cleans, pre-processes, segments, sentence-dividing, word-dividing and the like the corpus in a short text communication library to form an input format required by training the BERT language model, and then the basic BERT model is continuously trained to enhance the suitability of the BERT model to field data.
Pre-training model data composition: domain data, domain common sense library, knowledge triples.
The invention makes the following improvements in the aspect of model: 1) Fusing the domain common sense information and the knowledge triples into pre-training data so that the pre-training model has corresponding expert knowledge; 2) Instead of using the original word level MASK, word level MASK is used, so that more priori knowledge is integrated, and damage to the unique word structure of the Chinese word is reduced; 3) Training is carried out on a text with 128 length without adopting an original method, then fine adjustment is carried out on the length of 64 in two stages, and training is directly carried out on the sequence length of 64, so that a model can adapt to a short text; 4) Adjusting gradient accumulation super-parameters to enable a model batch to be accumulated to 64, and calculating a loss once to simulate a subsequent real training scene; 5) Because the data is shorter, a larger batch size is adopted, and a longer training step length is adopted; 6) Removing the original lower sentence prediction loss function; inputting the processed data into a pre-training model to obtain the field BERT pre-training model by training 80 ten thousand steps with the learning rate of 2 e-6.
(2) Data enhancement: data enhancement is a common approach to solving the bottleneck of training data, and is widely used (e.g., displacement, flipping, rotation, scaling, etc.) especially in CVs. In terms of data enhancement of word-level NLP tasks, the existing methods have poor effects, such as word deletion, word addition, word replacement and the like, which can cause sentence imperfection and semantic change, and poor effects on tasks sensitive to NER context. Cross-language task settings, themselves, also contain low-resource problems, i.e., no or little training data for the target language. The commonly used cross-language method is a translation train, i.e. the training data in the source language is translated into the target language, and the training data is obtained by label mapping. Most of the existing works are based on word alignment for label mapping, but the existing works cannot well deal with word order adjustment and alignment failure caused by translation. The invention completes data enhancement based on entity data enhancement and knowledge-graph data enhancement based on the above problems.
Wherein the entity-based augmentation comprises: one successful method of data enhancement in NLP is back translation, i.e. translating a monolingual sentence from the target language to the source language using a translation model to generate a semantically consistent parallel sentence, but the generated data is too poorly readable and semantically missing; synonym substitution, random deletion/exchange/insertion typically relies on additional knowledge, such as a manually designed dictionary like WordNet, but this approach may have low coverage or be directly unusable for low resource languages; the use of VAE or pre-trained language models to generate data is too poorly controllable and inevitably introduces more noise.
The invention provides a method based on entity enhancement, which uses a generating method to complete data enhancement on the basis of not changing the original semantics and structure, firstly linearising the marked sentences, then training a language model on the linearising data and generating new marking data, as shown in figure 4, annotating: all words and their tags are paired by inserting tags before (or after) the word, and the O-tag is removed.
The text generation method of the present invention unifies the process of sentence generation and tagging using language models. In linearization, the tag is inserted before the corresponding word, so the tag is considered as a modifier of these words (one word and its tag pair (e.g., "B-PER king") are generated together at training time, and a high probability tag word pair will be selected by the language model at the time the data is actually generated. For O labels frequently occurring in the sequence labeling task, deleting the O labels when linearizing labeling data. Likewise, a tag may be inserted after the corresponding word. After linearization of the annotation data, special tags [ BOS ] and [ EOS ] are added at the beginning and end of each sentence, respectively. These special tags can tag sentence boundaries, helping the language model to complete training and data generation.
After completion of the annotation data linearization, a language model is used to learn the distribution of words and tags. Firstly, adding a random mask mechanism to input data aiming at non-tag data, and increasing data diversity on the basis of not changing the semantics and frames of original data; generating a token embedding more in accordance with the real data logic by means of the fine-tuned pre-training model; and constructing a language model by combining the BiLSTM cyclic neural network, fusing context information, and generating data, wherein the model framework is shown in figure 5.
The language model is trained by maximizing the probability of the next token prediction. Given a sentence, a token sequence (w 1 ,w 2 ,...,w N ) Random mask to generate R t =Randon_mask(w t ) Then (R) 1 ,R 2 ,...,R N ) Inputting the bert pre-training model to obtain the Token email (x) 1 ,x 2 ,...,x N ) Where N is the sequence length, input BiLSTM to generate hidden states h at each location t t =LSTM(x 1 ,x 2 ,...,x t-1 ) And for each hidden state h t Applying dropout to generate d' t =dropout(h t ) Finally, the linear layer+softmax layer is used to predict the next token in the sequence.
After the language model has been trained, it can be used to generate synthetic training data for the markup tasks. In the generation process, only the [ BOS ] token is input into the language model, and the probability of all token in the word list is calculated according to the formula. Given [ BOS ], sentences are generated by successive autoregressions, wherein the token generated in the last step is used as input to generate the next token, so that all subsequent tokens are acquired one by one. Notably, the language model is more likely to select a token with a higher probability during generation, but the invention increases the sampling randomness during generation so that the language model can generate similar but different enhancement data given the same context.
Assuming the invention trains the language model by inserting tags before the corresponding words, when predicting the next token of a given input "Wang Dada want to go," the probability of "S-LOC" is much higher than other choices, because the language model has seen many similar examples in the training data, such as "I want to go to S-LOC," "he wants to go to S-LOC," etc. Then, when having "Wang Dada want to go to S-LOC", proceed to predict the next token. In the training data, all "S-LOC" are followed by place nouns, so "Beijing", "Shanghai", "hong Kong" etc. are all possible choices, and the predictive probabilities of these words are very close. With increased randomness, the model may choose any one of them.
The generated text sequence is in a linearization format, and the invention completes data post-processing in a rule-based mode: washing the sentences without labels; cleaning sentences of which all words are [ UNK ]; cleaning sentences with incorrect label order (same entity E, B out of order); the label conflict data and the duplicate data are purged.
Wherein, based on the knowledge-graph data enhancement includes: based on knowledge graph data enhancement, the method mainly comprises two processes: the extraction of specific tag entities is finished based on the bert+bilstm+self-attitution+crf, and the batch replacement of data is finished through the mapping of the hard matching dictionary by combining an existing knowledge base (default is a defined and self-built map mapping table, and a plurality of entity dictionaries which can be replaced are arranged in the table), so that the data quantity is increased.
In a specific tag entity extraction scheme, firstly, text feature extraction is carried out based on a BERT pre-training model to obtain a word granularity vector matrix, context information is learned by means of a BILSTM model, a plurality of self-attention mechanisms are adopted, original information is input into different spaces, context semantic information is obtained in a multi-dimensional mode, a plurality of sets of attention parameters are shared in the training process, text key features are extracted in an important mode, and the context semantic information is learned; and finally, the CRF processes the output sequence of the BiLSTM, and combines a state transition matrix in the CRF to obtain a global optimal sequence according to the labels between adjacent labels, so as to finish entity extraction. And (3) carrying out data batch replacement on the extracted entities through hard matching dictionary mapping to complete data enhancement based on the knowledge-graph, as shown in fig. 6.
3) Text classification: finally, completing the text classification task by using the result bert+softmax to obtain score Classification =[S 5 ,S 6 ,S 7 ,S 8 ]As shown in fig. 7.
Result post-processing_multi-policy fusion: model iterative updating is independently carried out on keywords and data intention recognition and similarity in a training stage, and in a testing stage, more accurate main is needed to be obtainedQuestion label we need score for three strategies Keyword(s) =[S 1 ,S 2 ,S 3 ,S 4 ]、score Classification =[S 5 ,S 6 ,S 7 ,S 8 ]Setting corresponding weights according to the contribution degree, and obtaining the final h through multiple experimental tuning Keyword(s) 、h Classification . Through S 1 *h Keyword(s) +S 5 *h Classification Obtaining each class of tag scores using max (S 1 *h Keyword(s) +S 5 *h Classification ) The maximum tag score is obtained by comparing max [ max (S 1 *h Keyword(s) +S 5 *h Classification ),Threshold]If the class label score is greater than the threshold, the data belongs to the corresponding label, otherwise belongs to other classes, and the final score label is as follows:
score total =[max[max(S 1 *h keyword(s) +S 5
*h Classification ),Threshold],……]
=[label 1, ,label 2 ,……,label n ]
Take [ label 1 ,,label 2 ,……,label n ]And if the mode is the other label, taking the secondary mode as the label of the study, otherwise taking the other label as the label of the scheme, setting a mode threshold at the same time, outputting the label when the label is larger than the mode threshold, and otherwise, still outputting the other label.
The invention is based on the short text communication data, and the machine can more finely distinguish the short text communication data labels by methods such as deep neural network, data enhancement and the like, and compared with the prior art, the invention has the following innovation points: the innovation point 1 is that a whole set of short text data classification method framework based on multi-strategy fusion is required to be protected; the innovation point 2 is that in the process of keyword extraction, in order to obtain the most relevant keywords to the problem, a TF-IDF calculation method is improved, a part-of-speech labeling strategy is fused, nouns in data are focused, and verb data are combined to complete proper noun combination, so that the obtained keywords are more in line with extraction logic, meanwhile, useless information such as common non-sensitive words is removed in a man-machine cooperation mode, and more accurate single keywords are obtained; reverse keywords are obtained through data analysis and carding, and keyword screening accuracy is improved; the innovation point 3 is that in order to solve the problems of unbalanced text classification data types and the like, the invention provides a text classification algorithm solution with fused data enhancement, and a BERT pre-training language model with field common sense is adopted to fuse the field common sense and knowledge triplet fine adjustment BERT model to complete field data adaptation; carrying out data enhancement by combining an entity enhancement mode, increasing data diversity, and completing batch data replacement by means of map mapping tables (self-construction, setting some replaceable entity dictionaries in the tables and mapping through hard matching dictionaries, and increasing data volume); finally, the entity and the data to be processed are collected to finish text classification based on the pre-training model and the bert+softmax model, and further the text classification accuracy is improved.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.
Claims (9)
1. A short text classification method based on multi-strategy fusion, which is characterized by comprising the following steps:
step SS1: finishing data marking by combing the marked short text data, and generating network short text data;
step SS2: preprocessing the network short text data obtained in the step SS1 to generate standardized data after text preprocessing;
step SS3: classifying TF-IDF keywords based on part of speech tagging on the standardized data after text preprocessing, calculating the relativity of words and all sentences in a library, and acquiring a keyword list in a mode of sorting and man-machine collaborative removal of very sensitive common words; carrying out data analysis on the standardized data after text preprocessing, extracting and outputting a combined keyword and a reverse keyword list, and configuring corresponding weights to prepare for subsequent content intention recognition and data screening;
Step SS4: aiming at the standardized data after text preprocessing, adopting a BERT pre-training language model of the domain common sense to fuse the domain common sense and a knowledge triplet fine tuning BERT model to complete the domain data adaptation; carrying out data enhancement by adopting a data enhancement algorithm based on entity enhancement, increasing data diversity, and completing data quantity enhancement by adopting a fusion knowledge graph data enhancement algorithm by means of entity extraction and self-constructed graph mapping tables; finally, merging entity enhancement data and data to be processed to finish text classification based on a pre-training model and a bert+softmax model;
step SS5: and outputting the data category labels and the probabilities according to the classification results of the step SS3 and the step SS 4.
2. The short text classification method based on multi-policy fusion according to claim 1, wherein said step SS2 specifically comprises: matching special characters and sensitive words containing deformed characters and expression packages by adopting regular expressions; the network short text data aiming at the anti-interception behavior adopts regular matching keywords and is replaced; aiming at network, activation and traditional Chinese characters, establishing a related word stock, and then using regular matching and replacing; and uniformly adjusting the traditional Chinese characters and the upper and lower English characters into Chinese simplified and lower English characters, and stopping word processing and removing invalid characters and website address information.
3. The short text classification method based on multi-policy fusion according to claim 1, wherein said step SS3 specifically comprises: calculating TF of each word and IDF among words according to standardized data after text preprocessing, calculating TF-IDF values, sorting to obtain keywords, removing very sensitive words by man-machine cooperation, and obtaining a keyword list, wherein TF represents word frequency, the calculating method is TF=the number of times that a certain word appears in an article/the total word number of the article, IDF represents inverse document frequency, the calculating method is IDF=log (the total number of documents in a corpus/(the number of documents containing a certain word+1)), and after the TF word frequency and the IDF inverse document frequency are obtained, the two numbers are multiplied to obtain the TF-IDF value of a word; and analyzing historical category label data according to the standardized data after text preprocessing, extracting category label key features, carrying out repeated experimental iteration updating to obtain a combined keyword, and finally outputting the keyword and a combined keyword list.
4. A short text classification method based on multi-policy fusion according to claim 3, wherein said step SS3 specifically further comprises: removing adverbs, adjectives, pronouns, conjunctions and noise reduction processing of sub-category part-of-speech words of the adverbs, the adjectives, the conjunctions and the sub-category part-of-speech words of the adjectives are carried out on the standardized data after text preprocessing, and the computational formula of TF is improved as follows:
The TF-IDF value of a shorter length word is increased by decreasing the word frequency of a longer length word.
5. A short text classification method based on multi-policy fusion according to claim 3, wherein said step SS3 specifically further comprises: corresponding TF-IDF values are calculated for all words in the document, common word data filtering is carried out on the keywords sequenced by the TF-IDF values in a man-machine cooperation mode, a single keyword word list is output, and corresponding weights h are configured according to the importance degree of the keywords Single sheet The method comprises the steps of carrying out a first treatment on the surface of the Extracting short words with TF-IDF values exceeding a set threshold value from each category, calculating combined keywords, searching the extracted nouns in an original database, obtaining noun and verb combined keywords, and using (the number of the combined words) -1 Reducing probability values of a plurality of combined words, and calculating TF-IDF values of the combined keywords by combining the following formulas:
finally, carrying out multi-round analysis on the data by combining with data analysis, extracting key characteristics of category labels, carrying out repeated experimental iteration update, outputting a combined key word and a reverse key word list, and configuring corresponding weight h Group of 、h Reverse-rotation The data screening strategy is as follows:
Score keyword(s) =h Single sheet *Score Single sheet +h Group of *Score Group of -h Reverse-rotation *Score Reverse-rotation
Preparation for subsequent content intent recognition, data screening, where h Group of Configuring weights for combined keywords, h Reverse-rotation Configuring weights for reverse keyword tables, h Single sheet Configuration weights for individual keywords, score Single sheet 、Score Group of 、Score Reverse-rotation Score results of the individual keywords, the combined keywords, and the reverse keyword table, respectively Keyword(s) Comprehensive scoring results for TF-IDF keyword classification.
6. The short text classification method based on multi-policy fusion according to claim 1, wherein said step SS4 specifically comprises: the data composition of the pre-training model comprises field data, a field common sense library and a knowledge triplet, wherein the BERT pre-training language model adopting the field common sense fuses the field common sense and the knowledge triplet fine tuning BERT model to complete field data adaptation specifically comprises the following steps: fusing the domain common sense information and the knowledge triples into pre-training data so that the pre-training model has corresponding expert knowledge; instead of performing MASK at the original word level, performing MASK at the word level, integrating more priori knowledge, and reducing damage to the word structure specific to the Chinese; training is not carried out on the text with 128 length firstly by adopting the original method, then fine adjustment is carried out on the length of 64 in two stages, and training is carried out on the sequence length of 64 directly, so that the model can adapt to short text; adjusting gradient accumulation super-parameters to enable a model batch to be accumulated to 64, and calculating a loss once to simulate a subsequent real training scene; a larger batch size is adopted, and a longer training step length is adopted; removing the original lower sentence prediction loss function; and inputting the data subjected to the adaptation processing into a pre-training model to obtain the field BERT pre-training model by training 80 ten thousand steps with the learning rate of 2 e-6.
7. The short text classification method based on multi-policy fusion according to claim 1, wherein the data enhancement based on the entity enhancement data enhancement algorithm in step SS4 specifically includes:
linearizing the labeled sentence, training a language model on the linearization data and generating new labeling data, wherein in the linearization process, the label is inserted before the corresponding words, so that the label is regarded as modifier of the words; for O labels frequently appearing in the sequence labeling task, deleting the O labels when linearizing labeling data; after the marking data are linearized, special marks [ BOS ] and [ EOS ] are added at the beginning and the end of each sentence respectively, and the special marks [ BOS ] and [ EOS ] are used for marking sentence boundaries to help the language model complete training and data generation;
after the linearization of the labeling data is completed, a language model is used for learning the distribution of words and labels, a random mask mechanism is added to input data aiming at non-label data, and the data diversity is increased on the basis of not changing the semantics and the framework of the original data; generating a token embedding more in accordance with the real data logic by means of the fine-tuned pre-training model; constructing a language model by combining the BiLSTM cyclic neural network, fusing context information, and generating data;
Training a language model by maximizing the probability of the next token prediction, given a sentence, first a token sequence (w 1 ,w 2 ,...,w N ) Random mask to generate R t =Randon_mask(w t ) Then (R) 1 ,R 2 ,...,R N ) Inputting the bert pre-training model to obtain the Token email (x) 1 ,x 2 ,...,x N ) Where N is the sequence length, input BiLSTM to generate hidden states h at each location t t =LSTM(x 1 ,x 2 ,...,x t-1 ),And for each hidden state h t Applying dropout to generate d' t =dropout(h t ) Finally, the linear layer+softmax layer is used to predict the next token in the sequence;
after training a language model, generating synthetic training data for a marking task by using the trained language model, inputting a [ BOS ] token into the language model in the generation process, calculating the probability of all token sequences in a word list according to a formula, and generating sentences by successive autoregressive after the [ BOS ] is given, wherein the token sequence generated in the last step is used as input to generate a next token sequence, so that all subsequent token sequences are acquired one by one; the generated text sequence is in a linear format, and the data post-processing is completed in a rule-based mode: washing the sentences without labels; cleaning sentences of which all words are [ UNK ]; cleaning sentences with incorrect label sequence; the label conflict data and the duplicate data are purged.
8. The short text classification method based on multi-strategy fusion according to claim 1, wherein the step SS4 of using the fusion knowledge graph data enhancement algorithm to enhance the data quantity by means of entity extraction and self-constructed graph mapping table specifically comprises:
based on the bert+bilstm+self-intent+crf, the extraction of specific tag entities is completed, the batch replacement of data is completed through hard matching dictionary mapping by combining an existing knowledge base, and the data quantity is increased;
in the specific tag entity extraction, text feature extraction is carried out based on a BERT pre-training model to obtain a word granularity vector matrix, the context information is learned by means of a BILSTM model, a plurality of self-attention mechanisms are adopted, original information is input into different spaces, context semantic information is obtained in a multi-dimensional mode, a plurality of sets of attention parameters are shared in the training process, key text features are extracted in an important mode, and the context semantic information is learned; finally, the CRF processes the output sequence of the BiLSTM, and a global optimal sequence is obtained according to the label between adjacent CRF by combining the state transition matrix in the CRF, so as to complete entity extraction;
the extracted entities are mapped through a hard matching dictionary to complete data batch replacement, and data enhancement based on knowledge maps is completed;
Finally, completing the text classification task by using the result bert+softmax to obtain score Classification =[S 5 ,S 6 ,S 7 ,S 8 ]。
9. The short text classification method based on multi-policy fusion according to claim 1, wherein said step SS5 specifically comprises:
model iteration updating is independently carried out on keywords, data intention recognition and similarity in a training stage, more accurate theme labels are required to be obtained in a testing stage, and a strategy score is required to be obtained in a testing stage Keyword(s) =[S 1 ,S 2 ,S 3 ,S 4 ]、score Classification =[S 5 ,S 6 ,S 7 ,S 8 ]Setting corresponding weights according to the contribution degree, and obtaining the final h through multiple experimental tuning Keyword(s) 、h Classification The method comprises the steps of carrying out a first treatment on the surface of the Through S 1 *h Keyword(s) +S 5 *h Classification Obtaining each class of tag scores using max (S 1 *h Keyword(s) +S 5 *h Classification ) The maximum tag score is obtained by comparing max [ max (S 1 *h Keyword(s) +S 5 *h Classification ),Threshold]If the class label score is greater than the threshold, the data belongs to the corresponding label, otherwise belongs to other classes, and the final score label is as follows:
score total =[max[max(S 1 *h keyword(s) +S 5 *h Classification ),Threshold],......]=[label 1 ,,label 2 ,......,label n ]
Take [ label 1 ,,label 2 ,......,label n ]If the mode is the other label, the secondary mode is taken as the selected label, otherwise the other label is taken as the label of the scheme, meanwhile, a mode threshold is set, when the label is larger than the mode threshold, the label is output, otherwise, the label still needs to be output And out of the "other" label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310446513.4A CN117251524A (en) | 2023-04-24 | 2023-04-24 | Short text classification method based on multi-strategy fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310446513.4A CN117251524A (en) | 2023-04-24 | 2023-04-24 | Short text classification method based on multi-strategy fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117251524A true CN117251524A (en) | 2023-12-19 |
Family
ID=89133838
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310446513.4A Pending CN117251524A (en) | 2023-04-24 | 2023-04-24 | Short text classification method based on multi-strategy fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117251524A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117787224A (en) * | 2023-12-27 | 2024-03-29 | 江南大学 | Controllable story generation method based on multi-source heterogeneous feature fusion |
CN118069852A (en) * | 2024-04-22 | 2024-05-24 | 数据空间研究院 | Multi-model fusion data classification prediction method and system |
CN118503419A (en) * | 2024-05-09 | 2024-08-16 | 北京九栖科技有限责任公司 | Junk short message classification method and device based on pseudo tag fusion clustering |
-
2023
- 2023-04-24 CN CN202310446513.4A patent/CN117251524A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117787224A (en) * | 2023-12-27 | 2024-03-29 | 江南大学 | Controllable story generation method based on multi-source heterogeneous feature fusion |
CN118069852A (en) * | 2024-04-22 | 2024-05-24 | 数据空间研究院 | Multi-model fusion data classification prediction method and system |
CN118503419A (en) * | 2024-05-09 | 2024-08-16 | 北京九栖科技有限责任公司 | Junk short message classification method and device based on pseudo tag fusion clustering |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Millstein | Natural language processing with python: natural language processing using NLTK | |
CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
RU2686000C1 (en) | Retrieval of information objects using a combination of classifiers analyzing local and non-local signs | |
Jones | Learning to extract entities from labeled and unlabeled text | |
US20140163951A1 (en) | Hybrid adaptation of named entity recognition | |
CN107180026B (en) | Event phrase learning method and device based on word embedding semantic mapping | |
CN117251524A (en) | Short text classification method based on multi-strategy fusion | |
CN115048944B (en) | Open domain dialogue reply method and system based on theme enhancement | |
US11170169B2 (en) | System and method for language-independent contextual embedding | |
CN114416942A (en) | Automatic question-answering method based on deep learning | |
Tan et al. | phi-LSTM: a phrase-based hierarchical LSTM model for image captioning | |
CN107315734A (en) | A kind of method and system for becoming pronouns, general term for nouns, numerals and measure words standardization based on time window and semanteme | |
CN114065758A (en) | Document keyword extraction method based on hypergraph random walk | |
Pandey et al. | An unsupervised Hindi stemmer with heuristic improvements | |
Errami et al. | Sentiment Analysis onMoroccan Dialect based on ML and Social Media Content Detection | |
Ren et al. | Detecting the scope of negation and speculation in biomedical texts by using recursive neural network | |
Jayasiriwardene et al. | Keyword extraction from Tweets using NLP tools for collecting relevant news | |
Tahayna et al. | Automatic sentiment annotation of idiomatic expressions for sentiment analysis task | |
CN111339772A (en) | Russian text emotion analysis method, electronic device and storage medium | |
Lahbari et al. | Toward a new arabic question answering system. | |
CN114490937A (en) | Comment analysis method and device based on semantic perception | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
Cherif et al. | New rules-based algorithm to improve Arabic stemming accuracy | |
Iosif et al. | Speech understanding for spoken dialogue systems: From corpus harvesting to grammar rule induction | |
Cristea et al. | From scan to text. Methodology, solutions and perspectives of deciphering old cyrillic Romanian documents into the Latin script |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |