CN117556049B - Text classification method of regular expression generated based on large language model - Google Patents
Text classification method of regular expression generated based on large language model Download PDFInfo
- Publication number
- CN117556049B CN117556049B CN202410034646.5A CN202410034646A CN117556049B CN 117556049 B CN117556049 B CN 117556049B CN 202410034646 A CN202410034646 A CN 202410034646A CN 117556049 B CN117556049 B CN 117556049B
- Authority
- CN
- China
- Prior art keywords
- data
- text
- text classification
- regular expressions
- matched
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000014509 gene expression Effects 0.000 title claims abstract description 211
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000001914 filtration Methods 0.000 claims abstract description 22
- 238000012549 training Methods 0.000 claims description 30
- 238000002372 labelling Methods 0.000 claims description 23
- 230000011218 segmentation Effects 0.000 claims description 11
- 238000007781 pre-processing Methods 0.000 claims description 10
- 238000012216 screening Methods 0.000 claims description 9
- 238000005516 engineering process Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 description 5
- 239000006071 cream Substances 0.000 description 4
- 230000000172 allergic effect Effects 0.000 description 3
- 208000010668 atopic eczema Diseases 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000008451 emotion Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000002238 attenuated effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of text classification, in particular to a text classification method of a regular expression generated based on a large language model, which comprises the following steps: s1: initializing a text classification method, defining text classification labels, and generating a regular expression set comprising white regular expressions and black regular expressions of a plurality of classification labels by adopting a large language model; s2: acquiring text data to be classified; s3: judging the semantic integrity of the text data by adopting a large language model, and filtering text data with incomplete semantics; s4: matching the text data according to a plurality of white regular expressions in the regular expression set, filtering classification labels which are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and then adding matched text classification labels for the text data. The invention uses regular expression set to realize the classification of text data, and the classification accuracy is high.
Description
Technical Field
The invention relates to the technical field of text classification, in particular to a text classification method of a regular expression generated based on a large language model.
Background
Along with the rapid development of big data and machine learning technologies, text classification has become an important research direction in the field of natural language processing, traditional text classification methods mainly depend on training of models and extraction of keywords, which generally need massive training data to ensure classification accuracy, such as a classification method (publication number: CN 116821348A) of Chinese ultra-long text based on a big language model disclosed in China patent.
Disclosure of Invention
The invention aims to solve the technical problems that: the existing text classification method has strong data dependency, and the accuracy is low by extracting keywords for classification.
In order to solve the technical problems, the invention adopts the following technical scheme: a text classification method of regular expression generated based on a large language model comprises the following steps:
s1: initializing a text classification method, defining text classification labels, generating a regular expression set comprising regular expressions of each of a plurality of text classification labels by adopting a large language model, setting the audited regular expressions as white regular expressions, and then generating corresponding black regular expressions by adopting the large language model according to the white regular expressions;
S2: acquiring text data to be classified;
s3: judging the semantic integrity of the text data by adopting a large language model, and filtering text data with incomplete semantics;
s4: matching the text data according to a plurality of white regular expressions in the regular expression set, filtering text classification labels which are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and then adding matched text classification labels for the text data.
When the method and the device work, the regular expression set generated by the large language model can quickly and accurately finish the classification of the text data, meanwhile, the matching is carried out on the text data according to the white regular expression and the black regular expression, the incorrectly matched text classification labels can be filtered, and the classification accuracy can be further improved.
Preferably, the method further comprises the steps of obtaining labeling data with text classification labels and question-answer data without text classification labels, wherein each question-answer data comprises questions and answers, screening the question-answer data through a plurality of preset preprocessing rules, and filtering the question-answer data irrelevant to the text classification.
When the invention works, the speed of training can be improved and the training period can be shortened by filtering irrelevant question-answer data.
Preferably, in the step S1, when a large language model is used to generate a regular expression set including regular expressions of each of a plurality of text classification labels, the following steps are adopted:
A1: generating semantic vector representation for the annotation data by adopting a preset sentence vector reasoning model;
A2: recalling matched question-answer data in a semantic space in a preset vector index library according to the vector representation of each annotation data;
A3: inputting the recalled question-answer data into a large language model for secondary classification judgment, filtering the question-answer data which is not matched with the semantics of the corresponding text classification label, marking the question-answer data which is matched with the semantics of the corresponding text classification label, and setting the marked data;
A4: classifying the labeling data belonging to the same text classification tag through a preset keyword word library, generating a syntax tree for the labeling data by adopting a syntax analysis tool, capturing the syntax information and semantic information of the labeling data through the syntax tree of each labeling data, and storing the syntax information and semantic information in one-to-one correspondence with the labeling data;
A5: and inputting a plurality of marking data belonging to the same keyword, respective syntax information of the plurality of marking data and a preset regular expression paradigm into the large language model, generating a plurality of regular expressions, and storing the plurality of regular expressions into a regular expression set of text classification labels corresponding to the keyword.
When the method works, the small sample learning capability of the large language model for generating the regular expression can be fully mined, the regular expression aiming at the specific text can be quickly and efficiently generated through the large language model based on the regular expression paradigm, so that the conversion of the semantic understanding capability of the large language model is completed, the degree of freedom is high, and the classification accuracy is high.
Preferably, in the step S1, the audited regular expression is set to be a white regular expression, and then when a corresponding black regular expression is generated by using a large language model according to the white regular expression, the following steps are adopted:
B1: the generated regular expressions are set to be white regular expressions after passing the auditing and are stored in a regular expression set of the text classification label;
B2: carrying out regular expression matching by adopting a white regular expression of a text classification label and question-answer data without the text classification label;
B3: and inputting the matched question-answer data into a large language model for secondary classification judgment, screening out question-answer data which is not matched with the semantics of the corresponding text classification label, acquiring the syntax information of the question-answer data and a preset regular expression normal form, inputting the syntax information and the preset regular expression normal form into the large language model, generating a plurality of black regular expressions, and storing the black regular expressions into a regular expression set of the text classification label.
When the method works, the robustness of the white regular expression can be improved through auditing the regular expression, the classification accuracy can be improved, meanwhile, the black regular expression is generated through secondary classification judgment of the large language model, the incorrectly matched text classification labels can be filtered, and the text classification accuracy is further improved.
Preferably, when training the sentence vector reasoning model, the following steps are adopted:
c1: setting questions with the same answer or the same kind of answer in question and answer data as positive samples, and setting questions with different answers or different kinds of questions in question and answer data as negative samples;
C2: and combining a plurality of questions corresponding to the same answer into positive sample pairs in pairs, and training to obtain a sentence vector reasoning model by adopting the positive sample pairs and the negative sample fine tuning base model through a comparison learning method.
When the method works, the matching degree of the sentence vector reasoning model and text data to be classified can be improved by constructing the proper training set fine tuning training sentence vector reasoning model, the semantic vector representation is more accurate when being generated, the method has strong distinguishing property, and positive samples and negative samples are conveniently distinguished.
Preferably, when the vector index library is established, the following steps are adopted:
D1: generating semantic vector representation for the questions in the question-answering data by adopting a sentence vector reasoning model;
d2: a hash algorithm is adopted to respectively generate a corresponding identifier for the questions in each question-answer data;
D3: vector representations of a number of questions are stored as a vector index library in one-to-one correspondence with corresponding identifiers.
When the invention works, the recall efficiency can be improved, the training efficiency can be improved, and the training period can be further shortened by establishing the vector index library which is high-efficiency and can be quickly searched.
Preferably, when establishing a keyword lexicon, the following steps are adopted:
e1: word segmentation is carried out on the labeling data and the question-answer data by adopting a word segmentation tool, a plurality of words are obtained and stored as a data set;
E2: analyzing the data set by adopting a TF-IDF algorithm, and assigning values to a plurality of words;
e3: setting vocabulary with weight value larger than a preset keyword threshold as keywords;
e4: training the data set by adopting a word embedding technology to obtain vector representation of each vocabulary;
E5: expanding the keywords, screening semantically matched words based on similarity of vector representation, and storing the words as a keyword word stock.
When the method works, the word segmentation of training data can be realized, the keywords can be extracted, and semantic clues can be provided when regular expressions are generated for a large language model by establishing a keyword word stock.
Preferably, in the step S2, after obtaining the text data to be classified, the method further includes a step of data preprocessing, and the text data to be classified is filtered by a plurality of preset preprocessing rules, so as to filter text data irrelevant to the current text classification.
Preferably, in the step S3, when the large language model is used to determine the semantic integrity of the text data, the large language model uses the preset small sample learning corpus and the preset thinking chain prompt word to determine the semantic integrity of the text data to be classified.
When the method works, the small sample learning corpus and the thinking chain prompt words are input to the large language model, so that the thinking chain capability of the large language model can be fully exerted, the large language model can comprehensively evaluate the semantic integrity of text data, and the accuracy is high.
Preferably, in the step S4, the matching is performed between the text data and a plurality of white regular expressions in the regular expression set, the text classification labels that are not matched with the text data are filtered according to a plurality of black regular expressions in the regular expression set, and then when the matched text classification labels are added to the text data, the following steps are adopted:
F1: traversing the white regular expressions of all text classification labels to match with the text data to be classified, turning to step F2 when the matched white regular expressions exist in the text data, and ending the text classification when the matched white regular expressions do not exist in the text data;
f2: traversing all black regular expressions of text classification labels matched with the text data, marking the text data by adopting the text classification labels when no matched black regular expression exists and the text data is matched with only one white regular expression of the text classification labels, switching to a step F3 when no matched black regular expression exists and the text data is matched with a plurality of white regular expressions of a plurality of text classification labels, and switching to a step F4 when the matched black regular expression exists;
F3: generating semantic vector representation for the text data by adopting a preset sentence vector reasoning model, recalling matched annotation data in a semantic space according to the vector representation of the text data, selecting a plurality of annotation data with text classification labels, and then acquiring the mode of the text classification labels and setting the mode as the text classification label of the text data;
F4: after filtering text classification labels with matched black regular expressions, when a plurality of white regular expressions matched with a plurality of text classification labels exist, turning to step F3, when the text classification labels are matched with the white regular expressions of one text classification label, marking the text data by adopting the text classification labels, and when the matched text classification labels do not exist, ending the text classification.
The beneficial technical effects of the invention include:
1. According to the text classification method, the regular expression set generated by the large language model can quickly and accurately complete classification of text data, meanwhile, the text classification labels which are incorrectly matched can be filtered out according to the white regular expression and the black regular expression which are matched with the text data, and classification accuracy can be further improved.
2. The invention can fully mine the small sample learning capability of generating the regular expression by the large language model, and can quickly and efficiently generate the regular expression aiming at the specific text by the large language model based on the regular expression paradigm, thereby completing the conversion of the semantic understanding capability of the large language model, and having high degree of freedom and high classification accuracy.
3. According to the method, the robustness of the white regular expression can be improved through auditing the regular expression, the classification accuracy can be improved, meanwhile, through secondary classification judgment of a large language model, a black regular expression is generated, the incorrectly matched text classification labels can be filtered, and the text classification accuracy is further improved.
4. The invention can improve the matching degree of the sentence vector reasoning model and the text data to be classified by constructing the proper training set fine tuning training sentence vector reasoning model, is more accurate when generating the semantic vector representation, has strong distinguishing property, and is convenient for distinguishing positive samples from negative samples.
5. The invention adopts the input of the small sample learning corpus and the thinking chain prompt words to the large language model, can fully exert the thinking chain capability of the large language model, and enables the large language model to comprehensively evaluate the semantic integrity of text data, and has higher accuracy.
Other features and advantages of the present invention will be disclosed in the following detailed description of the invention and the accompanying drawings.
Drawings
The invention is further described with reference to the accompanying drawings:
FIG. 1 is a flow chart of a text classification method based on regular expressions generated by a large language model;
FIG. 2 is a flow diagram of a large language model generating a regular expression set;
FIG. 3 is a flow chart of a large language model generating a white regular expression and a black regular expression;
Fig. 4 is a text classification flow chart according to the first embodiment.
Detailed Description
The technical solutions of the embodiments of the present invention will be explained and illustrated below with reference to the drawings of the embodiments of the present invention, but the following embodiments are only preferred embodiments of the present invention, and not all embodiments. Based on the examples in the embodiments, those skilled in the art can obtain other examples without making any inventive effort, which fall within the scope of the invention.
Embodiment one:
referring to fig. 1, the embodiment discloses a text classification method of regular expressions generated based on a large language model, which comprises the following steps:
s1: initializing a text classification method, defining text classification labels, generating a regular expression set comprising regular expressions of each of a plurality of text classification labels by adopting a large language model, setting the audited regular expressions as white regular expressions, and then generating corresponding black regular expressions by adopting the large language model according to the white regular expressions;
S2: acquiring text data to be classified;
s3: judging the semantic integrity of the text data by adopting a large language model, and filtering text data with incomplete semantics;
s4: matching the text data according to a plurality of white regular expressions in the regular expression set, filtering text classification labels which are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and then adding matched text classification labels for the text data.
When the embodiment works, the classification of text data can be rapidly and accurately completed through the regular expression set generated by the large language model, meanwhile, the text classification labels which are incorrectly matched can be filtered out according to the matching of the white regular expression and the black regular expression with the text data, and the classification accuracy can be further improved.
Preferably, the method further comprises the steps of obtaining labeling data with text classification labels and question-answer data without text classification labels, wherein each question-answer data comprises questions and answers, screening the question-answer data through a plurality of preset preprocessing rules, and filtering the question-answer data irrelevant to the text classification.
When the embodiment works, the speed of training can be improved and the training period can be shortened by filtering irrelevant question-answer data.
Preferably, in the step S2, after obtaining the text data to be classified, the method further includes a step of data preprocessing, and the text data to be classified is filtered by a plurality of preset preprocessing rules, so as to filter text data irrelevant to the current text classification.
In a specific implementation, the preset preprocessing rule may be defined manually, in this embodiment, the following rule may be used to filter question-answer data irrelevant to the current text classification, filter pure digital, linked and non-chinese data, filter long text with length greater than 32 and short text data with length less than 4, filter invalid data such as order number, address and system message, and also filter general questions of the electric vendor including related questions of delivery, ordering, boring, etc. by using the existing general question set of the electric vendor.
Referring to fig. 2, preferably, in the step S1, when a large language model is used to generate a regular expression set including regular expressions of each of a plurality of text classification labels, the following steps are adopted:
A1: generating semantic vector representation for the annotation data by adopting a preset sentence vector reasoning model;
A2: recalling matched question-answer data in a semantic space in a preset vector index library according to the vector representation of each annotation data;
A3: inputting the recalled question-answer data into a large language model for secondary classification judgment, filtering the question-answer data which is not matched with the semantics of the corresponding text classification label, marking the question-answer data which is matched with the semantics of the corresponding text classification label, and setting the marked data;
A4: classifying the labeling data belonging to the same text classification tag through a preset keyword word library, generating a syntax tree for the labeling data by adopting a syntax analysis tool, capturing the syntax information and semantic information of the labeling data through the syntax tree of each labeling data, and storing the syntax information and semantic information in one-to-one correspondence with the labeling data;
A5: and inputting a plurality of marking data belonging to the same keyword, respective syntax information of the plurality of marking data and a preset regular expression paradigm into the large language model, generating a plurality of regular expressions, and storing the plurality of regular expressions into a regular expression set of text classification labels corresponding to the keyword.
When the method and the device for generating the regular expression based on the regular expression model work, small sample learning capacity of the regular expression generated by the large language model can be fully mined, the regular expression aiming at specific texts can be quickly and efficiently generated based on the regular expression model, and therefore conversion of semantic understanding capacity of the large language model is completed, the degree of freedom is high, and classification accuracy is high.
In specific implementation, the following hint words may be input into a large language model to generate a regular expression:
"annotation data 1, regular expression 1;
Labeling data 2 and regular expression 2;
Annotating data 3, regular expression 3.
Referring to the regular expression of the labeling data, combining keywords: the key word 1 and the syntax information of each annotation data generate a plurality of correct regular expressions for the following batch of annotation data to be generated, and directly return to the related regular expressions:
to generate annotation data 1, syntax information 1;
To generate annotation data 2, syntax information 2;
To generate annotation data 3, syntax information 3."
Referring to fig. 3, preferably, in the step S1, the audited regular expression is set to be a white regular expression, and then when a large language model is used to generate a corresponding black regular expression according to the white regular expression, the following steps are adopted:
B1: the generated regular expressions are set to be white regular expressions after passing the auditing and are stored in a regular expression set of the text classification label;
B2: carrying out regular expression matching by adopting a white regular expression of a text classification label and question-answer data without the text classification label;
B3: and inputting the matched question-answer data into a large language model for secondary classification judgment, screening out question-answer data which is not matched with the semantics of the corresponding text classification label, acquiring the syntax information of the question-answer data and a preset regular expression normal form, inputting the syntax information and the preset regular expression normal form into the large language model, generating a plurality of black regular expressions, and storing the black regular expressions into a regular expression set of the text classification label.
When the embodiment works, the robustness of the white regular expression can be improved through auditing the regular expression, the classification accuracy can be improved, meanwhile, the black regular expression is generated through secondary classification judgment of the large language model, the incorrectly matched text classification labels can be filtered, and the text classification accuracy is further improved.
In specific implementation, the following hint words may be input into a large language model to generate a black regular expression:
"for the mismatching annotation data 1, on the regular basis of regular expression matching, a regular expression with more definition is generated by combining with syntax information, is used for matching the question sentence, and directly returns to the regular expression. "
In this embodiment, the large language model is preferably ChatGLM-6B dialogue language model, but any other existing large language model may be used.
Referring to fig. 4, preferably, in the step S4, matching is performed between the text data and a plurality of white regular expressions in the regular expression set, filtering is performed between text classification labels that are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and then when matched text classification labels are added to the text data, the following steps are adopted:
F1: traversing the white regular expressions of all text classification labels to match with the text data to be classified, turning to step F2 when the matched white regular expressions exist in the text data, and ending the text classification when the matched white regular expressions do not exist in the text data;
f2: traversing all black regular expressions of text classification labels matched with the text data, marking the text data by adopting the text classification labels when no matched black regular expression exists and the text data is matched with only one white regular expression of the text classification labels, switching to a step F3 when no matched black regular expression exists and the text data is matched with a plurality of white regular expressions of a plurality of text classification labels, and switching to a step F4 when the matched black regular expression exists;
F3: generating semantic vector representation for the text data by adopting a preset sentence vector reasoning model, recalling matched annotation data in a semantic space according to the vector representation of the text data, selecting a plurality of annotation data with text classification labels, and then acquiring the mode of the text classification labels and setting the mode as the text classification label of the text data;
F4: after filtering text classification labels with matched black regular expressions, when a plurality of white regular expressions matched with a plurality of text classification labels exist, turning to step F3, when the text classification labels are matched with the white regular expressions of one text classification label, marking the text data by adopting the text classification labels, and when the matched text classification labels do not exist, ending the text classification.
Embodiment two:
The embodiment provides a text classification method of a regular expression generated based on a large language model, and the same points as the first embodiment are not described in detail.
The embodiment further comprises the step of training a sentence vector reasoning model:
c1: setting questions with the same answer or the same kind of answer in question and answer data as positive samples, and setting questions with different answers or different kinds of questions in question and answer data as negative samples;
C2: and combining a plurality of questions corresponding to the same answer into positive sample pairs in pairs, and training to obtain a sentence vector reasoning model by adopting the positive sample pairs and the negative sample fine tuning base model through a comparison learning method.
When the method and the device work, the matching degree of the sentence vector reasoning model and text data to be classified can be improved by constructing a proper training set fine tuning training sentence vector reasoning model, the semantic vector representation is more accurate when being generated, the method and the device have strong distinguishing property, and positive samples and negative samples are conveniently distinguished.
In this embodiment, the base model is preferably ERNIE bi-directional semantic representation model, and before the base model is trimmed, the training data is converted into data which can be received by the base model by adopting word segmentation, adding special start and end markers and performing necessary filling or cutting-off, and the selected positive sample pair is set as the positive sample pairSetting the selected negative sample as the negative sample
Vector encoding of ERNIE can be obtained and recorded as/>Meanwhile, a similarity function is defined as cosine similarity, namely:
;
。
the triplet loss function in the training process can be defined as:
;
Wherein: To space, the effect is to try to pull the difference between the similarity of the positive samples and the similarity of the negative samples apart, and the value can be dynamically attenuated and changed as the number of iterations of training increases.
In the fine tuning optimization process, the model parameter θ needs to be found to minimize the sum of the loss functions of all pairs of samples, namely:
;
wherein, Is a parameter of ERNIE models.
Embodiment III:
The embodiment provides a text classification method of a regular expression generated based on a large language model, and the same points as the first embodiment are not described in detail.
The embodiment further includes the step of creating a vector index library:
D1: generating semantic vector representation for the questions in the question-answering data by adopting a sentence vector reasoning model;
d2: a hash algorithm is adopted to respectively generate a corresponding identifier for the questions in each question-answer data;
D3: vector representations of a number of questions are stored as a vector index library in one-to-one correspondence with corresponding identifiers.
When the embodiment works, the recall efficiency can be improved, the training efficiency can be improved, and the training period can be further shortened by establishing the vector index library which is high-efficiency and can be quickly searched.
In specific implementation, the hash algorithm is preferably a SHA-256 hash algorithm, and the FAISS library can be used as a framework to build a vector index library, and of course, any other existing hash algorithm or library can be adopted.
Embodiment four:
The embodiment provides a text classification method of a regular expression generated based on a large language model, and the same points as the first embodiment are not described in detail.
The embodiment further includes the step of establishing a keyword lexicon:
e1: word segmentation is carried out on the labeling data and the question-answer data by adopting a word segmentation tool, a plurality of words are obtained and stored as a data set;
E2: analyzing the data set by adopting a TF-IDF algorithm, and assigning values to a plurality of words;
e3: setting vocabulary with weight value larger than a preset keyword threshold as keywords;
e4: training the data set by adopting a word embedding technology to obtain vector representation of each vocabulary;
E5: expanding the keywords, screening semantically matched words based on similarity of vector representation, and storing the words as a keyword word stock.
When the embodiment works, word segmentation of training data can be achieved, keywords can be extracted, and semantic clues can be provided when regular expressions are generated for a large language model by establishing a keyword word stock.
In the specific implementation, the word segmentation tool is preferably jieba, and the word embedding technology is preferably FastText or GloVe, and any existing word segmentation tool or word embedding technology can be selected according to actual requirements.
Fifth embodiment:
The embodiment provides a text classification method of a regular expression generated based on a large language model, and the same points as the first embodiment are not described in detail.
In the step S3, when the large language model is adopted to judge the semantic integrity of the text data, the large language model uses the preset small sample learning corpus and the thinking chain prompt words to judge the semantic integrity of the text data to be classified.
When the embodiment works, the small sample learning corpus and the thinking chain prompt words are input to the large language model, so that the thinking chain capability of the large language model can be fully exerted, the large language model can comprehensively evaluate the semantic integrity of text data, and the accuracy is high.
In specific implementation, the related thought chain prompting words can adopt objects, parts, attributes and the like, for example, the following thought chain prompting words can be adopted to realize judgment of semantic integrity:
"example sentence 1: whether the cream is allergic or not;
thinking chain prompt words: the key point of the method is that the client inquires whether the face cream is allergic, the product is the face cream, and the product characteristics are whether the face cream is allergic, so that the inquiry intention of the client is completely expressed;
conclusion: and (3) completing.
Example sentence 2: i skin is too dry;
thinking chain prompt words: the key point of the sentence is that the customer states the skin state of the customer, the description object is the customer, the part is skin, and the attribute is dry, so that the query intention of the customer is completely expressed;
conclusion: and (3) completing.
Example sentence 3: i dislike the feeling of wet on the face;
Thinking chain prompt words: the key point of the sentence is that the client states the preference of the client, the description object is the client, the part is the face, the attribute is wet, and the emotion is dislike, so that the query intention of the client is completely expressed;
conclusion: and (3) completing.
Example sentence 4: what is meant by the instant? Can be used;
Thinking chain prompt words: the key point of the sentence is that the customer inquires about the availability of the product, the description of the product is unclear, the inquiry purpose is unclear, the emotion is negative, the expression is possibly not full, the function efficacy of the inquired product is also possibly possible, and the inquiry intention of the user at the moment needs to be analyzed in combination with the context;
Conclusion: incomplete.
Example sentence 5: you do nothing;
thinking chain prompt words: the key point of the sentence is that the customer expresses the product in a non-going way, the description of the product is unclear, the emotion is negative, the expression is possibly insufficient, the function and the efficacy of the product are possibly inquired, and the inquiry intention of the user at the moment needs to be analyzed in combination with the context;
Conclusion: incomplete.
Combining the process of semantic integrity reasoning analysis of the example sentences and the final conclusion of the semantic integrity, deducing the semantic integrity of the following sentences in detail, and giving a conclusion, complete or incomplete: text data. "
The beneficial technical effects of the invention include: according to the text classification method, the regular expression set generated by the large language model can quickly and accurately complete classification of text data, meanwhile, the text classification labels which are incorrectly matched can be filtered out according to the white regular expression and the black regular expression which are matched with the text data, and classification accuracy can be further improved.
While the invention has been described in terms of embodiments, it will be appreciated by those skilled in the art that the invention is not limited thereto but rather includes the drawings and the description of the embodiments above. Any modifications which do not depart from the functional and structural principles of the present invention are intended to be included within the scope of the appended claims.
Claims (9)
1. A text classification method of regular expressions generated based on a large language model is characterized by comprising the following steps:
s1: initializing a text classification method, defining text classification labels, generating a regular expression set comprising regular expressions of each of a plurality of text classification labels by adopting a large language model, setting the audited regular expressions as white regular expressions, and then generating corresponding black regular expressions by adopting the large language model according to the white regular expressions;
S2: acquiring text data to be classified;
s3: judging the semantic integrity of the text data by adopting a large language model, and filtering text data with incomplete semantics;
S4: matching the text data according to a plurality of white regular expressions in the regular expression set, filtering text classification labels which are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and adding the matched text classification labels to the text data, wherein the following steps are adopted:
F1: traversing the white regular expressions of all text classification labels to match with the text data to be classified, turning to step F2 when the matched white regular expressions exist in the text data, and ending the text classification when the matched white regular expressions do not exist in the text data;
f2: traversing all black regular expressions of text classification labels matched with the text data, marking the text data by adopting the text classification labels when no matched black regular expression exists and the text data is matched with only one white regular expression of the text classification labels, switching to a step F3 when no matched black regular expression exists and the text data is matched with a plurality of white regular expressions of a plurality of text classification labels, and switching to a step F4 when the matched black regular expression exists;
F3: generating semantic vector representation for the text data by adopting a preset sentence vector reasoning model, recalling matched annotation data in a semantic space according to the vector representation of the text data, selecting a plurality of annotation data with text classification labels, and then acquiring the mode of the text classification labels and setting the mode as the text classification label of the text data;
F4: after filtering text classification labels with matched black regular expressions, when a plurality of white regular expressions matched with a plurality of text classification labels exist, turning to step F3, when the text classification labels are matched with the white regular expressions of one text classification label, marking the text data by adopting the text classification labels, and when the matched text classification labels do not exist, ending the text classification.
2. The text classification method of regular expressions generated based on a large language model according to claim 1, further comprising the steps of obtaining labeling data with text classification labels and question-answer data without text classification labels, wherein each question-answer data comprises questions and answers, screening the question-answer data through a plurality of preset preprocessing rules, and filtering the question-answer data irrelevant to the text classification.
3. The method for classifying text according to claim 2, wherein in the step S1, when a large language model is used to generate a regular expression set including regular expressions of each of a plurality of text classification labels, the following steps are adopted:
A1: generating semantic vector representation for the annotation data by adopting a preset sentence vector reasoning model;
A2: recalling matched question-answer data in a semantic space in a preset vector index library according to the vector representation of each annotation data;
A3: inputting the recalled question-answer data into a large language model for secondary classification judgment, filtering the question-answer data which is not matched with the semantics of the corresponding text classification label, marking the question-answer data which is matched with the semantics of the corresponding text classification label, and setting the marked data;
A4: classifying the labeling data belonging to the same text classification tag through a preset keyword word library, generating a syntax tree for the labeling data by adopting a syntax analysis tool, capturing the syntax information and semantic information of the labeling data through the syntax tree of each labeling data, and storing the syntax information and semantic information in one-to-one correspondence with the labeling data;
A5: and inputting a plurality of marking data belonging to the same keyword, respective syntax information of the plurality of marking data and a preset regular expression paradigm into the large language model, generating a plurality of regular expressions, and storing the plurality of regular expressions into a regular expression set of text classification labels corresponding to the keyword.
4. The method for classifying text based on regular expressions generated by a large language model according to claim 3, wherein in the step S1, when the audited regular expressions are set to be white regular expressions, and then corresponding black regular expressions are generated by using the large language model according to the white regular expressions, the following steps are adopted:
B1: the generated regular expressions are set to be white regular expressions after passing the auditing and are stored in a regular expression set of the text classification label;
B2: carrying out regular expression matching by adopting a white regular expression of a text classification label and question-answer data without the text classification label;
B3: and inputting the matched question-answer data into a large language model for secondary classification judgment, screening out question-answer data which is not matched with the semantics of the corresponding text classification label, acquiring the syntax information of the question-answer data and a preset regular expression normal form, inputting the syntax information and the preset regular expression normal form into the large language model, generating a plurality of black regular expressions, and storing the black regular expressions into a regular expression set of the text classification label.
5. A method for text classification of regular expressions generated based on a large language model as claimed in claim 3, wherein the training of the sentence vector inference model comprises the steps of:
c1: setting questions with the same answer or the same kind of answer in question and answer data as positive samples, and setting questions with different answers or different kinds of questions in question and answer data as negative samples;
C2: and combining a plurality of questions corresponding to the same answer into positive sample pairs in pairs, and training to obtain a sentence vector reasoning model by adopting the positive sample pairs and the negative sample fine tuning base model through a comparison learning method.
6. A method for classifying text based on regular expressions generated by a large language model as claimed in claim 3, wherein when the vector index library is built, the following steps are adopted:
D1: generating semantic vector representation for the questions in the question-answering data by adopting a sentence vector reasoning model;
d2: a hash algorithm is adopted to respectively generate a corresponding identifier for the questions in each question-answer data;
D3: vector representations of a number of questions are stored as a vector index library in one-to-one correspondence with corresponding identifiers.
7. The text classification method of regular expressions generated based on a large language model as claimed in claim 3, wherein when the keyword lexicon is built, the following steps are adopted:
e1: word segmentation is carried out on the labeling data and the question-answer data by adopting a word segmentation tool, a plurality of words are obtained and stored as a data set;
E2: analyzing the data set by adopting a TF-IDF algorithm, and assigning values to a plurality of words;
e3: setting vocabulary with weight value larger than a preset keyword threshold as keywords;
e4: training the data set by adopting a word embedding technology to obtain vector representation of each vocabulary;
E5: expanding the keywords, screening semantically matched words based on similarity of vector representation, and storing the words as a keyword word stock.
8. The text classification method of regular expressions generated based on large language models according to claim 1, wherein in the step S2, after obtaining text data to be classified, further comprising a step of data preprocessing, filtering the text data to be classified by a plurality of preset preprocessing rules, and filtering text data irrelevant to the current text classification.
9. The method for classifying text based on regular expressions generated by a large language model according to claim 1, wherein in the step S3, when the large language model is used to determine the semantic integrity of text data, the large language model uses the preset small sample learning corpus and the preset thinking chain prompt word to input the small sample learning corpus and the preset thinking chain prompt word to the large language model, and the large language model uses the preset thinking chain prompt word to determine the semantic integrity of the text data to be classified.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410034646.5A CN117556049B (en) | 2024-01-10 | 2024-01-10 | Text classification method of regular expression generated based on large language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410034646.5A CN117556049B (en) | 2024-01-10 | 2024-01-10 | Text classification method of regular expression generated based on large language model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117556049A CN117556049A (en) | 2024-02-13 |
CN117556049B true CN117556049B (en) | 2024-05-17 |
Family
ID=89820826
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410034646.5A Active CN117556049B (en) | 2024-01-10 | 2024-01-10 | Text classification method of regular expression generated based on large language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117556049B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104217717A (en) * | 2013-05-29 | 2014-12-17 | 腾讯科技(深圳)有限公司 | Language model constructing method and device |
CN106446230A (en) * | 2016-10-08 | 2017-02-22 | 国云科技股份有限公司 | Method for optimizing word classification in machine learning text |
CN108182234A (en) * | 2017-12-27 | 2018-06-19 | 中科鼎富(北京)科技发展有限公司 | Regular expression screening technique and device |
CN112364660A (en) * | 2020-10-27 | 2021-02-12 | 中国平安人寿保险股份有限公司 | Corpus text processing method and device, computer equipment and storage medium |
CN113111234A (en) * | 2020-02-13 | 2021-07-13 | 北京明亿科技有限公司 | Regular expression-based alarm condition category determination method and device |
CN113761903A (en) * | 2020-06-05 | 2021-12-07 | 国家计算机网络与信息安全管理中心 | Text screening method for high-volume high-noise spoken short text |
CN114595332A (en) * | 2022-03-30 | 2022-06-07 | 阳光保险集团股份有限公司 | Text classification prediction method and device and electronic equipment |
CN114818891A (en) * | 2022-04-14 | 2022-07-29 | 人民网股份有限公司 | Small sample multi-label text classification model training method and text classification method |
CN116561311A (en) * | 2023-04-21 | 2023-08-08 | 武汉大学 | Automatic classification method for quotation text based on large language model |
US11748577B1 (en) * | 2022-08-22 | 2023-09-05 | Rohirrim, Inc. | Computer-generated content based on text classification, semantic relevance, and activation of deep learning large language models |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017039603A1 (en) * | 2015-08-31 | 2017-03-09 | Hewlett Packard Enterprise Development Lp | Domain classification |
CN112487182B (en) * | 2019-09-12 | 2024-04-12 | 华为技术有限公司 | Training method of text processing model, text processing method and device |
US20230419037A1 (en) * | 2022-06-24 | 2023-12-28 | Salesforce, Inc. | Systems and methods for text classification using label modular prompts |
-
2024
- 2024-01-10 CN CN202410034646.5A patent/CN117556049B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104217717A (en) * | 2013-05-29 | 2014-12-17 | 腾讯科技(深圳)有限公司 | Language model constructing method and device |
CN106446230A (en) * | 2016-10-08 | 2017-02-22 | 国云科技股份有限公司 | Method for optimizing word classification in machine learning text |
CN108182234A (en) * | 2017-12-27 | 2018-06-19 | 中科鼎富(北京)科技发展有限公司 | Regular expression screening technique and device |
CN113111234A (en) * | 2020-02-13 | 2021-07-13 | 北京明亿科技有限公司 | Regular expression-based alarm condition category determination method and device |
CN113761903A (en) * | 2020-06-05 | 2021-12-07 | 国家计算机网络与信息安全管理中心 | Text screening method for high-volume high-noise spoken short text |
CN112364660A (en) * | 2020-10-27 | 2021-02-12 | 中国平安人寿保险股份有限公司 | Corpus text processing method and device, computer equipment and storage medium |
CN114595332A (en) * | 2022-03-30 | 2022-06-07 | 阳光保险集团股份有限公司 | Text classification prediction method and device and electronic equipment |
CN114818891A (en) * | 2022-04-14 | 2022-07-29 | 人民网股份有限公司 | Small sample multi-label text classification model training method and text classification method |
US11748577B1 (en) * | 2022-08-22 | 2023-09-05 | Rohirrim, Inc. | Computer-generated content based on text classification, semantic relevance, and activation of deep learning large language models |
CN116561311A (en) * | 2023-04-21 | 2023-08-08 | 武汉大学 | Automatic classification method for quotation text based on large language model |
Non-Patent Citations (4)
Title |
---|
Enabling Digital Transformation through Business Text Classification with Small Datasets;Muhammad Arslan etc.;《 2023 15th International Conference on Innovations in Information Technology (IIT)》;20231225;第38-42页 * |
Leveraging Large Language Models for Topic Classification in the Domain of Public Affairs;Pena, A etc.;《 arXiv》;20230901;第1-12页 * |
基于医学大数据的预训练语言模型及其医学文本分类研究;黄敏婷 等;《中华医学图书情报杂志》;20201231;第39-46页 * |
大规模短文本的分类过滤方法研究;吴薇;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20070515;第I138-1565页 * |
Also Published As
Publication number | Publication date |
---|---|
CN117556049A (en) | 2024-02-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109635297B (en) | Entity disambiguation method and device, computer device and computer storage medium | |
CN111241294A (en) | Graph convolution network relation extraction method based on dependency analysis and key words | |
CN110895559A (en) | Model training method, text processing method, device and equipment | |
CN110717045A (en) | Letter element automatic extraction method based on letter overview | |
CN113360582B (en) | Relation classification method and system based on BERT model fusion multi-entity information | |
CN111143531A (en) | Question-answer pair construction method, system, device and computer readable storage medium | |
CN113312922B (en) | Improved chapter-level triple information extraction method | |
CN112115252A (en) | Intelligent auxiliary writing processing method and device, electronic equipment and storage medium | |
CN114997181A (en) | Intelligent question-answering method and system based on user feedback correction | |
CN112347339A (en) | Search result processing method and device | |
CN117251524A (en) | Short text classification method based on multi-strategy fusion | |
CN112101014A (en) | Chinese chemical industry document word segmentation method based on mixed feature fusion | |
CN111460147A (en) | Title short text classification method based on semantic enhancement | |
CN113361252B (en) | Text depression tendency detection system based on multi-modal features and emotion dictionary | |
CN116049376B (en) | Method, device and system for retrieving and replying information and creating knowledge | |
CN117556049B (en) | Text classification method of regular expression generated based on large language model | |
CN116244277A (en) | NLP (non-linear point) identification and knowledge base construction method and system | |
CN114117069A (en) | Semantic understanding method and system for intelligent knowledge graph question answering | |
CN111027308A (en) | Text generation method, system, mobile terminal and storage medium | |
CN110852104B (en) | Family tree identification method and device, storage medium and processor | |
CN116227496B (en) | Deep learning-based electric public opinion entity relation extraction method and system | |
CN117453895B (en) | Intelligent customer service response method, device, equipment and readable storage medium | |
CN117609419A (en) | Domain retrieval method based on meta learning and knowledge enhancement | |
CN118113806A (en) | Interpretable event context generation method for large model retrieval enhancement generation | |
CN118036725A (en) | Atlas generation method and system based on text big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |