CN117556049B

CN117556049B - Text classification method of regular expression generated based on large language model

Info

Publication number: CN117556049B
Application number: CN202410034646.5A
Authority: CN
Inventors: 谭光华; 陈禹; 林庭羽
Original assignee: Hangzhou Guangyun Technology Co ltd
Current assignee: Hangzhou Guangyun Technology Co ltd
Priority date: 2024-01-10
Filing date: 2024-01-10
Publication date: 2024-05-17
Anticipated expiration: 2044-01-10
Also published as: CN117556049A

Abstract

The invention relates to the technical field of text classification, in particular to a text classification method of a regular expression generated based on a large language model, which comprises the following steps: s1: initializing a text classification method, defining text classification labels, and generating a regular expression set comprising white regular expressions and black regular expressions of a plurality of classification labels by adopting a large language model; s2: acquiring text data to be classified; s3: judging the semantic integrity of the text data by adopting a large language model, and filtering text data with incomplete semantics; s4: matching the text data according to a plurality of white regular expressions in the regular expression set, filtering classification labels which are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and then adding matched text classification labels for the text data. The invention uses regular expression set to realize the classification of text data, and the classification accuracy is high.

Description

Text classification method of regular expression generated based on large language model

Technical Field

The invention relates to the technical field of text classification, in particular to a text classification method of a regular expression generated based on a large language model.

Background

Along with the rapid development of big data and machine learning technologies, text classification has become an important research direction in the field of natural language processing, traditional text classification methods mainly depend on training of models and extraction of keywords, which generally need massive training data to ensure classification accuracy, such as a classification method (publication number: CN 116821348A) of Chinese ultra-long text based on a big language model disclosed in China patent.

Disclosure of Invention

The invention aims to solve the technical problems that: the existing text classification method has strong data dependency, and the accuracy is low by extracting keywords for classification.

In order to solve the technical problems, the invention adopts the following technical scheme: a text classification method of regular expression generated based on a large language model comprises the following steps:

s1: initializing a text classification method, defining text classification labels, generating a regular expression set comprising regular expressions of each of a plurality of text classification labels by adopting a large language model, setting the audited regular expressions as white regular expressions, and then generating corresponding black regular expressions by adopting the large language model according to the white regular expressions;

S2: acquiring text data to be classified;

s3: judging the semantic integrity of the text data by adopting a large language model, and filtering text data with incomplete semantics;

s4: matching the text data according to a plurality of white regular expressions in the regular expression set, filtering text classification labels which are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and then adding matched text classification labels for the text data.

When the method and the device work, the regular expression set generated by the large language model can quickly and accurately finish the classification of the text data, meanwhile, the matching is carried out on the text data according to the white regular expression and the black regular expression, the incorrectly matched text classification labels can be filtered, and the classification accuracy can be further improved.

Preferably, the method further comprises the steps of obtaining labeling data with text classification labels and question-answer data without text classification labels, wherein each question-answer data comprises questions and answers, screening the question-answer data through a plurality of preset preprocessing rules, and filtering the question-answer data irrelevant to the text classification.

When the invention works, the speed of training can be improved and the training period can be shortened by filtering irrelevant question-answer data.

Preferably, in the step S1, when a large language model is used to generate a regular expression set including regular expressions of each of a plurality of text classification labels, the following steps are adopted:

A1: generating semantic vector representation for the annotation data by adopting a preset sentence vector reasoning model;

A2: recalling matched question-answer data in a semantic space in a preset vector index library according to the vector representation of each annotation data;

A3: inputting the recalled question-answer data into a large language model for secondary classification judgment, filtering the question-answer data which is not matched with the semantics of the corresponding text classification label, marking the question-answer data which is matched with the semantics of the corresponding text classification label, and setting the marked data;

A4: classifying the labeling data belonging to the same text classification tag through a preset keyword word library, generating a syntax tree for the labeling data by adopting a syntax analysis tool, capturing the syntax information and semantic information of the labeling data through the syntax tree of each labeling data, and storing the syntax information and semantic information in one-to-one correspondence with the labeling data;

A5: and inputting a plurality of marking data belonging to the same keyword, respective syntax information of the plurality of marking data and a preset regular expression paradigm into the large language model, generating a plurality of regular expressions, and storing the plurality of regular expressions into a regular expression set of text classification labels corresponding to the keyword.

When the method works, the small sample learning capability of the large language model for generating the regular expression can be fully mined, the regular expression aiming at the specific text can be quickly and efficiently generated through the large language model based on the regular expression paradigm, so that the conversion of the semantic understanding capability of the large language model is completed, the degree of freedom is high, and the classification accuracy is high.

Preferably, in the step S1, the audited regular expression is set to be a white regular expression, and then when a corresponding black regular expression is generated by using a large language model according to the white regular expression, the following steps are adopted:

B1: the generated regular expressions are set to be white regular expressions after passing the auditing and are stored in a regular expression set of the text classification label;

B2: carrying out regular expression matching by adopting a white regular expression of a text classification label and question-answer data without the text classification label;

B3: and inputting the matched question-answer data into a large language model for secondary classification judgment, screening out question-answer data which is not matched with the semantics of the corresponding text classification label, acquiring the syntax information of the question-answer data and a preset regular expression normal form, inputting the syntax information and the preset regular expression normal form into the large language model, generating a plurality of black regular expressions, and storing the black regular expressions into a regular expression set of the text classification label.

When the method works, the robustness of the white regular expression can be improved through auditing the regular expression, the classification accuracy can be improved, meanwhile, the black regular expression is generated through secondary classification judgment of the large language model, the incorrectly matched text classification labels can be filtered, and the text classification accuracy is further improved.

Preferably, when training the sentence vector reasoning model, the following steps are adopted:

c1: setting questions with the same answer or the same kind of answer in question and answer data as positive samples, and setting questions with different answers or different kinds of questions in question and answer data as negative samples;

C2: and combining a plurality of questions corresponding to the same answer into positive sample pairs in pairs, and training to obtain a sentence vector reasoning model by adopting the positive sample pairs and the negative sample fine tuning base model through a comparison learning method.

When the method works, the matching degree of the sentence vector reasoning model and text data to be classified can be improved by constructing the proper training set fine tuning training sentence vector reasoning model, the semantic vector representation is more accurate when being generated, the method has strong distinguishing property, and positive samples and negative samples are conveniently distinguished.

Preferably, when the vector index library is established, the following steps are adopted:

D1: generating semantic vector representation for the questions in the question-answering data by adopting a sentence vector reasoning model;

d2: a hash algorithm is adopted to respectively generate a corresponding identifier for the questions in each question-answer data;

D3: vector representations of a number of questions are stored as a vector index library in one-to-one correspondence with corresponding identifiers.

When the invention works, the recall efficiency can be improved, the training efficiency can be improved, and the training period can be further shortened by establishing the vector index library which is high-efficiency and can be quickly searched.

Preferably, when establishing a keyword lexicon, the following steps are adopted:

e1: word segmentation is carried out on the labeling data and the question-answer data by adopting a word segmentation tool, a plurality of words are obtained and stored as a data set;

E2: analyzing the data set by adopting a TF-IDF algorithm, and assigning values to a plurality of words;

e3: setting vocabulary with weight value larger than a preset keyword threshold as keywords;

e4: training the data set by adopting a word embedding technology to obtain vector representation of each vocabulary;

E5: expanding the keywords, screening semantically matched words based on similarity of vector representation, and storing the words as a keyword word stock.

When the method works, the word segmentation of training data can be realized, the keywords can be extracted, and semantic clues can be provided when regular expressions are generated for a large language model by establishing a keyword word stock.

Preferably, in the step S2, after obtaining the text data to be classified, the method further includes a step of data preprocessing, and the text data to be classified is filtered by a plurality of preset preprocessing rules, so as to filter text data irrelevant to the current text classification.

Preferably, in the step S3, when the large language model is used to determine the semantic integrity of the text data, the large language model uses the preset small sample learning corpus and the preset thinking chain prompt word to determine the semantic integrity of the text data to be classified.

When the method works, the small sample learning corpus and the thinking chain prompt words are input to the large language model, so that the thinking chain capability of the large language model can be fully exerted, the large language model can comprehensively evaluate the semantic integrity of text data, and the accuracy is high.

Preferably, in the step S4, the matching is performed between the text data and a plurality of white regular expressions in the regular expression set, the text classification labels that are not matched with the text data are filtered according to a plurality of black regular expressions in the regular expression set, and then when the matched text classification labels are added to the text data, the following steps are adopted:

F1: traversing the white regular expressions of all text classification labels to match with the text data to be classified, turning to step F2 when the matched white regular expressions exist in the text data, and ending the text classification when the matched white regular expressions do not exist in the text data;

f2: traversing all black regular expressions of text classification labels matched with the text data, marking the text data by adopting the text classification labels when no matched black regular expression exists and the text data is matched with only one white regular expression of the text classification labels, switching to a step F3 when no matched black regular expression exists and the text data is matched with a plurality of white regular expressions of a plurality of text classification labels, and switching to a step F4 when the matched black regular expression exists;

F3: generating semantic vector representation for the text data by adopting a preset sentence vector reasoning model, recalling matched annotation data in a semantic space according to the vector representation of the text data, selecting a plurality of annotation data with text classification labels, and then acquiring the mode of the text classification labels and setting the mode as the text classification label of the text data;

F4: after filtering text classification labels with matched black regular expressions, when a plurality of white regular expressions matched with a plurality of text classification labels exist, turning to step F3, when the text classification labels are matched with the white regular expressions of one text classification label, marking the text data by adopting the text classification labels, and when the matched text classification labels do not exist, ending the text classification.

The beneficial technical effects of the invention include:

1. According to the text classification method, the regular expression set generated by the large language model can quickly and accurately complete classification of text data, meanwhile, the text classification labels which are incorrectly matched can be filtered out according to the white regular expression and the black regular expression which are matched with the text data, and classification accuracy can be further improved.

2. The invention can fully mine the small sample learning capability of generating the regular expression by the large language model, and can quickly and efficiently generate the regular expression aiming at the specific text by the large language model based on the regular expression paradigm, thereby completing the conversion of the semantic understanding capability of the large language model, and having high degree of freedom and high classification accuracy.

3. According to the method, the robustness of the white regular expression can be improved through auditing the regular expression, the classification accuracy can be improved, meanwhile, through secondary classification judgment of a large language model, a black regular expression is generated, the incorrectly matched text classification labels can be filtered, and the text classification accuracy is further improved.

4. The invention can improve the matching degree of the sentence vector reasoning model and the text data to be classified by constructing the proper training set fine tuning training sentence vector reasoning model, is more accurate when generating the semantic vector representation, has strong distinguishing property, and is convenient for distinguishing positive samples from negative samples.

5. The invention adopts the input of the small sample learning corpus and the thinking chain prompt words to the large language model, can fully exert the thinking chain capability of the large language model, and enables the large language model to comprehensively evaluate the semantic integrity of text data, and has higher accuracy.

Other features and advantages of the present invention will be disclosed in the following detailed description of the invention and the accompanying drawings.

Drawings

The invention is further described with reference to the accompanying drawings:

FIG. 1 is a flow chart of a text classification method based on regular expressions generated by a large language model;

FIG. 2 is a flow diagram of a large language model generating a regular expression set;

FIG. 3 is a flow chart of a large language model generating a white regular expression and a black regular expression;

Fig. 4 is a text classification flow chart according to the first embodiment.

Detailed Description

The technical solutions of the embodiments of the present invention will be explained and illustrated below with reference to the drawings of the embodiments of the present invention, but the following embodiments are only preferred embodiments of the present invention, and not all embodiments. Based on the examples in the embodiments, those skilled in the art can obtain other examples without making any inventive effort, which fall within the scope of the invention.

Embodiment one:

referring to fig. 1, the embodiment discloses a text classification method of regular expressions generated based on a large language model, which comprises the following steps:

S2: acquiring text data to be classified;

When the embodiment works, the classification of text data can be rapidly and accurately completed through the regular expression set generated by the large language model, meanwhile, the text classification labels which are incorrectly matched can be filtered out according to the matching of the white regular expression and the black regular expression with the text data, and the classification accuracy can be further improved.

When the embodiment works, the speed of training can be improved and the training period can be shortened by filtering irrelevant question-answer data.

In a specific implementation, the preset preprocessing rule may be defined manually, in this embodiment, the following rule may be used to filter question-answer data irrelevant to the current text classification, filter pure digital, linked and non-chinese data, filter long text with length greater than 32 and short text data with length less than 4, filter invalid data such as order number, address and system message, and also filter general questions of the electric vendor including related questions of delivery, ordering, boring, etc. by using the existing general question set of the electric vendor.

Referring to fig. 2, preferably, in the step S1, when a large language model is used to generate a regular expression set including regular expressions of each of a plurality of text classification labels, the following steps are adopted:

When the method and the device for generating the regular expression based on the regular expression model work, small sample learning capacity of the regular expression generated by the large language model can be fully mined, the regular expression aiming at specific texts can be quickly and efficiently generated based on the regular expression model, and therefore conversion of semantic understanding capacity of the large language model is completed, the degree of freedom is high, and classification accuracy is high.

In specific implementation, the following hint words may be input into a large language model to generate a regular expression:

"annotation data 1, regular expression 1;

Labeling data 2 and regular expression 2;

Annotating data 3, regular expression 3.

Referring to the regular expression of the labeling data, combining keywords: the key word 1 and the syntax information of each annotation data generate a plurality of correct regular expressions for the following batch of annotation data to be generated, and directly return to the related regular expressions:

to generate annotation data 1, syntax information 1;

To generate annotation data 2, syntax information 2;

To generate annotation data 3, syntax information 3."

Referring to fig. 3, preferably, in the step S1, the audited regular expression is set to be a white regular expression, and then when a large language model is used to generate a corresponding black regular expression according to the white regular expression, the following steps are adopted:

When the embodiment works, the robustness of the white regular expression can be improved through auditing the regular expression, the classification accuracy can be improved, meanwhile, the black regular expression is generated through secondary classification judgment of the large language model, the incorrectly matched text classification labels can be filtered, and the text classification accuracy is further improved.

In specific implementation, the following hint words may be input into a large language model to generate a black regular expression:

"for the mismatching annotation data 1, on the regular basis of regular expression matching, a regular expression with more definition is generated by combining with syntax information, is used for matching the question sentence, and directly returns to the regular expression. "

In this embodiment, the large language model is preferably ChatGLM-6B dialogue language model, but any other existing large language model may be used.

Referring to fig. 4, preferably, in the step S4, matching is performed between the text data and a plurality of white regular expressions in the regular expression set, filtering is performed between text classification labels that are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and then when matched text classification labels are added to the text data, the following steps are adopted:

Embodiment two:

The embodiment provides a text classification method of a regular expression generated based on a large language model, and the same points as the first embodiment are not described in detail.

The embodiment further comprises the step of training a sentence vector reasoning model:

When the method and the device work, the matching degree of the sentence vector reasoning model and text data to be classified can be improved by constructing a proper training set fine tuning training sentence vector reasoning model, the semantic vector representation is more accurate when being generated, the method and the device have strong distinguishing property, and positive samples and negative samples are conveniently distinguished.

In this embodiment, the base model is preferably ERNIE bi-directional semantic representation model, and before the base model is trimmed, the training data is converted into data which can be received by the base model by adopting word segmentation, adding special start and end markers and performing necessary filling or cutting-off, and the selected positive sample pair is set as the positive sample pairSetting the selected negative sample as the negative sample

Vector encoding of ERNIE can be obtained and recorded as/>Meanwhile, a similarity function is defined as cosine similarity, namely:

；

。

the triplet loss function in the training process can be defined as:

；

Wherein: To space, the effect is to try to pull the difference between the similarity of the positive samples and the similarity of the negative samples apart, and the value can be dynamically attenuated and changed as the number of iterations of training increases.

In the fine tuning optimization process, the model parameter θ needs to be found to minimize the sum of the loss functions of all pairs of samples, namely:

；

wherein, Is a parameter of ERNIE models.

Embodiment III:

The embodiment further includes the step of creating a vector index library:

When the embodiment works, the recall efficiency can be improved, the training efficiency can be improved, and the training period can be further shortened by establishing the vector index library which is high-efficiency and can be quickly searched.

In specific implementation, the hash algorithm is preferably a SHA-256 hash algorithm, and the FAISS library can be used as a framework to build a vector index library, and of course, any other existing hash algorithm or library can be adopted.

Embodiment four:

The embodiment further includes the step of establishing a keyword lexicon:

When the embodiment works, word segmentation of training data can be achieved, keywords can be extracted, and semantic clues can be provided when regular expressions are generated for a large language model by establishing a keyword word stock.

In the specific implementation, the word segmentation tool is preferably jieba, and the word embedding technology is preferably FastText or GloVe, and any existing word segmentation tool or word embedding technology can be selected according to actual requirements.

Fifth embodiment:

In the step S3, when the large language model is adopted to judge the semantic integrity of the text data, the large language model uses the preset small sample learning corpus and the thinking chain prompt words to judge the semantic integrity of the text data to be classified.

When the embodiment works, the small sample learning corpus and the thinking chain prompt words are input to the large language model, so that the thinking chain capability of the large language model can be fully exerted, the large language model can comprehensively evaluate the semantic integrity of text data, and the accuracy is high.

In specific implementation, the related thought chain prompting words can adopt objects, parts, attributes and the like, for example, the following thought chain prompting words can be adopted to realize judgment of semantic integrity:

"example sentence 1: whether the cream is allergic or not;

thinking chain prompt words: the key point of the method is that the client inquires whether the face cream is allergic, the product is the face cream, and the product characteristics are whether the face cream is allergic, so that the inquiry intention of the client is completely expressed;

conclusion: and (3) completing.

Example sentence 2: i skin is too dry;

thinking chain prompt words: the key point of the sentence is that the customer states the skin state of the customer, the description object is the customer, the part is skin, and the attribute is dry, so that the query intention of the customer is completely expressed;

conclusion: and (3) completing.

Example sentence 3: i dislike the feeling of wet on the face;

Thinking chain prompt words: the key point of the sentence is that the client states the preference of the client, the description object is the client, the part is the face, the attribute is wet, and the emotion is dislike, so that the query intention of the client is completely expressed;

conclusion: and (3) completing.

Example sentence 4: what is meant by the instant? Can be used;

Thinking chain prompt words: the key point of the sentence is that the customer inquires about the availability of the product, the description of the product is unclear, the inquiry purpose is unclear, the emotion is negative, the expression is possibly not full, the function efficacy of the inquired product is also possibly possible, and the inquiry intention of the user at the moment needs to be analyzed in combination with the context;

Conclusion: incomplete.

Example sentence 5: you do nothing;

thinking chain prompt words: the key point of the sentence is that the customer expresses the product in a non-going way, the description of the product is unclear, the emotion is negative, the expression is possibly insufficient, the function and the efficacy of the product are possibly inquired, and the inquiry intention of the user at the moment needs to be analyzed in combination with the context;

Conclusion: incomplete.

Combining the process of semantic integrity reasoning analysis of the example sentences and the final conclusion of the semantic integrity, deducing the semantic integrity of the following sentences in detail, and giving a conclusion, complete or incomplete: text data. "

The beneficial technical effects of the invention include: according to the text classification method, the regular expression set generated by the large language model can quickly and accurately complete classification of text data, meanwhile, the text classification labels which are incorrectly matched can be filtered out according to the white regular expression and the black regular expression which are matched with the text data, and classification accuracy can be further improved.

While the invention has been described in terms of embodiments, it will be appreciated by those skilled in the art that the invention is not limited thereto but rather includes the drawings and the description of the embodiments above. Any modifications which do not depart from the functional and structural principles of the present invention are intended to be included within the scope of the appended claims.

Claims

1. A text classification method of regular expressions generated based on a large language model is characterized by comprising the following steps:

S2: acquiring text data to be classified;

S4: matching the text data according to a plurality of white regular expressions in the regular expression set, filtering text classification labels which are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and adding the matched text classification labels to the text data, wherein the following steps are adopted:

2. The text classification method of regular expressions generated based on a large language model according to claim 1, further comprising the steps of obtaining labeling data with text classification labels and question-answer data without text classification labels, wherein each question-answer data comprises questions and answers, screening the question-answer data through a plurality of preset preprocessing rules, and filtering the question-answer data irrelevant to the text classification.

3. The method for classifying text according to claim 2, wherein in the step S1, when a large language model is used to generate a regular expression set including regular expressions of each of a plurality of text classification labels, the following steps are adopted:

4. The method for classifying text based on regular expressions generated by a large language model according to claim 3, wherein in the step S1, when the audited regular expressions are set to be white regular expressions, and then corresponding black regular expressions are generated by using the large language model according to the white regular expressions, the following steps are adopted:

5. A method for text classification of regular expressions generated based on a large language model as claimed in claim 3, wherein the training of the sentence vector inference model comprises the steps of:

6. A method for classifying text based on regular expressions generated by a large language model as claimed in claim 3, wherein when the vector index library is built, the following steps are adopted:

7. The text classification method of regular expressions generated based on a large language model as claimed in claim 3, wherein when the keyword lexicon is built, the following steps are adopted:

8. The text classification method of regular expressions generated based on large language models according to claim 1, wherein in the step S2, after obtaining text data to be classified, further comprising a step of data preprocessing, filtering the text data to be classified by a plurality of preset preprocessing rules, and filtering text data irrelevant to the current text classification.

9. The method for classifying text based on regular expressions generated by a large language model according to claim 1, wherein in the step S3, when the large language model is used to determine the semantic integrity of text data, the large language model uses the preset small sample learning corpus and the preset thinking chain prompt word to input the small sample learning corpus and the preset thinking chain prompt word to the large language model, and the large language model uses the preset thinking chain prompt word to determine the semantic integrity of the text data to be classified.