CN117556049B - Text classification method of regular expression generated based on large language model - Google Patents

Text classification method of regular expression generated based on large language model Download PDF

Info

Publication number
CN117556049B
CN117556049B CN202410034646.5A CN202410034646A CN117556049B CN 117556049 B CN117556049 B CN 117556049B CN 202410034646 A CN202410034646 A CN 202410034646A CN 117556049 B CN117556049 B CN 117556049B
Authority
CN
China
Prior art keywords
data
text
text classification
regular expressions
matched
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410034646.5A
Other languages
Chinese (zh)
Other versions
CN117556049A (en
Inventor
谭光华
陈禹
林庭羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Guangyun Technology Co ltd
Original Assignee
Hangzhou Guangyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Guangyun Technology Co ltd filed Critical Hangzhou Guangyun Technology Co ltd
Priority to CN202410034646.5A priority Critical patent/CN117556049B/en
Publication of CN117556049A publication Critical patent/CN117556049A/en
Application granted granted Critical
Publication of CN117556049B publication Critical patent/CN117556049B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of text classification, in particular to a text classification method of a regular expression generated based on a large language model, which comprises the following steps: s1: initializing a text classification method, defining text classification labels, and generating a regular expression set comprising white regular expressions and black regular expressions of a plurality of classification labels by adopting a large language model; s2: acquiring text data to be classified; s3: judging the semantic integrity of the text data by adopting a large language model, and filtering text data with incomplete semantics; s4: matching the text data according to a plurality of white regular expressions in the regular expression set, filtering classification labels which are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and then adding matched text classification labels for the text data. The invention uses regular expression set to realize the classification of text data, and the classification accuracy is high.

Description

Text classification method of regular expression generated based on large language model
Technical Field
The invention relates to the technical field of text classification, in particular to a text classification method of a regular expression generated based on a large language model.
Background
Along with the rapid development of big data and machine learning technologies, text classification has become an important research direction in the field of natural language processing, traditional text classification methods mainly depend on training of models and extraction of keywords, which generally need massive training data to ensure classification accuracy, such as a classification method (publication number: CN 116821348A) of Chinese ultra-long text based on a big language model disclosed in China patent.
Disclosure of Invention
The invention aims to solve the technical problems that: the existing text classification method has strong data dependency, and the accuracy is low by extracting keywords for classification.
In order to solve the technical problems, the invention adopts the following technical scheme: a text classification method of regular expression generated based on a large language model comprises the following steps:
s1: initializing a text classification method, defining text classification labels, generating a regular expression set comprising regular expressions of each of a plurality of text classification labels by adopting a large language model, setting the audited regular expressions as white regular expressions, and then generating corresponding black regular expressions by adopting the large language model according to the white regular expressions;
S2: acquiring text data to be classified;
s3: judging the semantic integrity of the text data by adopting a large language model, and filtering text data with incomplete semantics;
s4: matching the text data according to a plurality of white regular expressions in the regular expression set, filtering text classification labels which are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and then adding matched text classification labels for the text data.
When the method and the device work, the regular expression set generated by the large language model can quickly and accurately finish the classification of the text data, meanwhile, the matching is carried out on the text data according to the white regular expression and the black regular expression, the incorrectly matched text classification labels can be filtered, and the classification accuracy can be further improved.
Preferably, the method further comprises the steps of obtaining labeling data with text classification labels and question-answer data without text classification labels, wherein each question-answer data comprises questions and answers, screening the question-answer data through a plurality of preset preprocessing rules, and filtering the question-answer data irrelevant to the text classification.
When the invention works, the speed of training can be improved and the training period can be shortened by filtering irrelevant question-answer data.
Preferably, in the step S1, when a large language model is used to generate a regular expression set including regular expressions of each of a plurality of text classification labels, the following steps are adopted:
A1: generating semantic vector representation for the annotation data by adopting a preset sentence vector reasoning model;
A2: recalling matched question-answer data in a semantic space in a preset vector index library according to the vector representation of each annotation data;
A3: inputting the recalled question-answer data into a large language model for secondary classification judgment, filtering the question-answer data which is not matched with the semantics of the corresponding text classification label, marking the question-answer data which is matched with the semantics of the corresponding text classification label, and setting the marked data;
A4: classifying the labeling data belonging to the same text classification tag through a preset keyword word library, generating a syntax tree for the labeling data by adopting a syntax analysis tool, capturing the syntax information and semantic information of the labeling data through the syntax tree of each labeling data, and storing the syntax information and semantic information in one-to-one correspondence with the labeling data;
A5: and inputting a plurality of marking data belonging to the same keyword, respective syntax information of the plurality of marking data and a preset regular expression paradigm into the large language model, generating a plurality of regular expressions, and storing the plurality of regular expressions into a regular expression set of text classification labels corresponding to the keyword.
When the method works, the small sample learning capability of the large language model for generating the regular expression can be fully mined, the regular expression aiming at the specific text can be quickly and efficiently generated through the large language model based on the regular expression paradigm, so that the conversion of the semantic understanding capability of the large language model is completed, the degree of freedom is high, and the classification accuracy is high.
Preferably, in the step S1, the audited regular expression is set to be a white regular expression, and then when a corresponding black regular expression is generated by using a large language model according to the white regular expression, the following steps are adopted:
B1: the generated regular expressions are set to be white regular expressions after passing the auditing and are stored in a regular expression set of the text classification label;
B2: carrying out regular expression matching by adopting a white regular expression of a text classification label and question-answer data without the text classification label;
B3: and inputting the matched question-answer data into a large language model for secondary classification judgment, screening out question-answer data which is not matched with the semantics of the corresponding text classification label, acquiring the syntax information of the question-answer data and a preset regular expression normal form, inputting the syntax information and the preset regular expression normal form into the large language model, generating a plurality of black regular expressions, and storing the black regular expressions into a regular expression set of the text classification label.
When the method works, the robustness of the white regular expression can be improved through auditing the regular expression, the classification accuracy can be improved, meanwhile, the black regular expression is generated through secondary classification judgment of the large language model, the incorrectly matched text classification labels can be filtered, and the text classification accuracy is further improved.
Preferably, when training the sentence vector reasoning model, the following steps are adopted:
c1: setting questions with the same answer or the same kind of answer in question and answer data as positive samples, and setting questions with different answers or different kinds of questions in question and answer data as negative samples;
C2: and combining a plurality of questions corresponding to the same answer into positive sample pairs in pairs, and training to obtain a sentence vector reasoning model by adopting the positive sample pairs and the negative sample fine tuning base model through a comparison learning method.
When the method works, the matching degree of the sentence vector reasoning model and text data to be classified can be improved by constructing the proper training set fine tuning training sentence vector reasoning model, the semantic vector representation is more accurate when being generated, the method has strong distinguishing property, and positive samples and negative samples are conveniently distinguished.
Preferably, when the vector index library is established, the following steps are adopted:
D1: generating semantic vector representation for the questions in the question-answering data by adopting a sentence vector reasoning model;
d2: a hash algorithm is adopted to respectively generate a corresponding identifier for the questions in each question-answer data;
D3: vector representations of a number of questions are stored as a vector index library in one-to-one correspondence with corresponding identifiers.
When the invention works, the recall efficiency can be improved, the training efficiency can be improved, and the training period can be further shortened by establishing the vector index library which is high-efficiency and can be quickly searched.
Preferably, when establishing a keyword lexicon, the following steps are adopted:
e1: word segmentation is carried out on the labeling data and the question-answer data by adopting a word segmentation tool, a plurality of words are obtained and stored as a data set;
E2: analyzing the data set by adopting a TF-IDF algorithm, and assigning values to a plurality of words;
e3: setting vocabulary with weight value larger than a preset keyword threshold as keywords;
e4: training the data set by adopting a word embedding technology to obtain vector representation of each vocabulary;
E5: expanding the keywords, screening semantically matched words based on similarity of vector representation, and storing the words as a keyword word stock.
When the method works, the word segmentation of training data can be realized, the keywords can be extracted, and semantic clues can be provided when regular expressions are generated for a large language model by establishing a keyword word stock.
Preferably, in the step S2, after obtaining the text data to be classified, the method further includes a step of data preprocessing, and the text data to be classified is filtered by a plurality of preset preprocessing rules, so as to filter text data irrelevant to the current text classification.
Preferably, in the step S3, when the large language model is used to determine the semantic integrity of the text data, the large language model uses the preset small sample learning corpus and the preset thinking chain prompt word to determine the semantic integrity of the text data to be classified.
When the method works, the small sample learning corpus and the thinking chain prompt words are input to the large language model, so that the thinking chain capability of the large language model can be fully exerted, the large language model can comprehensively evaluate the semantic integrity of text data, and the accuracy is high.
Preferably, in the step S4, the matching is performed between the text data and a plurality of white regular expressions in the regular expression set, the text classification labels that are not matched with the text data are filtered according to a plurality of black regular expressions in the regular expression set, and then when the matched text classification labels are added to the text data, the following steps are adopted:
F1: traversing the white regular expressions of all text classification labels to match with the text data to be classified, turning to step F2 when the matched white regular expressions exist in the text data, and ending the text classification when the matched white regular expressions do not exist in the text data;
f2: traversing all black regular expressions of text classification labels matched with the text data, marking the text data by adopting the text classification labels when no matched black regular expression exists and the text data is matched with only one white regular expression of the text classification labels, switching to a step F3 when no matched black regular expression exists and the text data is matched with a plurality of white regular expressions of a plurality of text classification labels, and switching to a step F4 when the matched black regular expression exists;
F3: generating semantic vector representation for the text data by adopting a preset sentence vector reasoning model, recalling matched annotation data in a semantic space according to the vector representation of the text data, selecting a plurality of annotation data with text classification labels, and then acquiring the mode of the text classification labels and setting the mode as the text classification label of the text data;
F4: after filtering text classification labels with matched black regular expressions, when a plurality of white regular expressions matched with a plurality of text classification labels exist, turning to step F3, when the text classification labels are matched with the white regular expressions of one text classification label, marking the text data by adopting the text classification labels, and when the matched text classification labels do not exist, ending the text classification.
The beneficial technical effects of the invention include:
1. According to the text classification method, the regular expression set generated by the large language model can quickly and accurately complete classification of text data, meanwhile, the text classification labels which are incorrectly matched can be filtered out according to the white regular expression and the black regular expression which are matched with the text data, and classification accuracy can be further improved.
2. The invention can fully mine the small sample learning capability of generating the regular expression by the large language model, and can quickly and efficiently generate the regular expression aiming at the specific text by the large language model based on the regular expression paradigm, thereby completing the conversion of the semantic understanding capability of the large language model, and having high degree of freedom and high classification accuracy.
3. According to the method, the robustness of the white regular expression can be improved through auditing the regular expression, the classification accuracy can be improved, meanwhile, through secondary classification judgment of a large language model, a black regular expression is generated, the incorrectly matched text classification labels can be filtered, and the text classification accuracy is further improved.
4. The invention can improve the matching degree of the sentence vector reasoning model and the text data to be classified by constructing the proper training set fine tuning training sentence vector reasoning model, is more accurate when generating the semantic vector representation, has strong distinguishing property, and is convenient for distinguishing positive samples from negative samples.
5. The invention adopts the input of the small sample learning corpus and the thinking chain prompt words to the large language model, can fully exert the thinking chain capability of the large language model, and enables the large language model to comprehensively evaluate the semantic integrity of text data, and has higher accuracy.
Other features and advantages of the present invention will be disclosed in the following detailed description of the invention and the accompanying drawings.
Drawings
The invention is further described with reference to the accompanying drawings:
FIG. 1 is a flow chart of a text classification method based on regular expressions generated by a large language model;
FIG. 2 is a flow diagram of a large language model generating a regular expression set;
FIG. 3 is a flow chart of a large language model generating a white regular expression and a black regular expression;
Fig. 4 is a text classification flow chart according to the first embodiment.
Detailed Description
The technical solutions of the embodiments of the present invention will be explained and illustrated below with reference to the drawings of the embodiments of the present invention, but the following embodiments are only preferred embodiments of the present invention, and not all embodiments. Based on the examples in the embodiments, those skilled in the art can obtain other examples without making any inventive effort, which fall within the scope of the invention.
Embodiment one:
referring to fig. 1, the embodiment discloses a text classification method of regular expressions generated based on a large language model, which comprises the following steps:
s1: initializing a text classification method, defining text classification labels, generating a regular expression set comprising regular expressions of each of a plurality of text classification labels by adopting a large language model, setting the audited regular expressions as white regular expressions, and then generating corresponding black regular expressions by adopting the large language model according to the white regular expressions;
S2: acquiring text data to be classified;
s3: judging the semantic integrity of the text data by adopting a large language model, and filtering text data with incomplete semantics;
s4: matching the text data according to a plurality of white regular expressions in the regular expression set, filtering text classification labels which are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and then adding matched text classification labels for the text data.
When the embodiment works, the classification of text data can be rapidly and accurately completed through the regular expression set generated by the large language model, meanwhile, the text classification labels which are incorrectly matched can be filtered out according to the matching of the white regular expression and the black regular expression with the text data, and the classification accuracy can be further improved.
Preferably, the method further comprises the steps of obtaining labeling data with text classification labels and question-answer data without text classification labels, wherein each question-answer data comprises questions and answers, screening the question-answer data through a plurality of preset preprocessing rules, and filtering the question-answer data irrelevant to the text classification.
When the embodiment works, the speed of training can be improved and the training period can be shortened by filtering irrelevant question-answer data.
Preferably, in the step S2, after obtaining the text data to be classified, the method further includes a step of data preprocessing, and the text data to be classified is filtered by a plurality of preset preprocessing rules, so as to filter text data irrelevant to the current text classification.
In a specific implementation, the preset preprocessing rule may be defined manually, in this embodiment, the following rule may be used to filter question-answer data irrelevant to the current text classification, filter pure digital, linked and non-chinese data, filter long text with length greater than 32 and short text data with length less than 4, filter invalid data such as order number, address and system message, and also filter general questions of the electric vendor including related questions of delivery, ordering, boring, etc. by using the existing general question set of the electric vendor.
Referring to fig. 2, preferably, in the step S1, when a large language model is used to generate a regular expression set including regular expressions of each of a plurality of text classification labels, the following steps are adopted:
A1: generating semantic vector representation for the annotation data by adopting a preset sentence vector reasoning model;
A2: recalling matched question-answer data in a semantic space in a preset vector index library according to the vector representation of each annotation data;
A3: inputting the recalled question-answer data into a large language model for secondary classification judgment, filtering the question-answer data which is not matched with the semantics of the corresponding text classification label, marking the question-answer data which is matched with the semantics of the corresponding text classification label, and setting the marked data;
A4: classifying the labeling data belonging to the same text classification tag through a preset keyword word library, generating a syntax tree for the labeling data by adopting a syntax analysis tool, capturing the syntax information and semantic information of the labeling data through the syntax tree of each labeling data, and storing the syntax information and semantic information in one-to-one correspondence with the labeling data;
A5: and inputting a plurality of marking data belonging to the same keyword, respective syntax information of the plurality of marking data and a preset regular expression paradigm into the large language model, generating a plurality of regular expressions, and storing the plurality of regular expressions into a regular expression set of text classification labels corresponding to the keyword.
When the method and the device for generating the regular expression based on the regular expression model work, small sample learning capacity of the regular expression generated by the large language model can be fully mined, the regular expression aiming at specific texts can be quickly and efficiently generated based on the regular expression model, and therefore conversion of semantic understanding capacity of the large language model is completed, the degree of freedom is high, and classification accuracy is high.
In specific implementation, the following hint words may be input into a large language model to generate a regular expression:
"annotation data 1, regular expression 1;
Labeling data 2 and regular expression 2;
Annotating data 3, regular expression 3.
Referring to the regular expression of the labeling data, combining keywords: the key word 1 and the syntax information of each annotation data generate a plurality of correct regular expressions for the following batch of annotation data to be generated, and directly return to the related regular expressions:
to generate annotation data 1, syntax information 1;
To generate annotation data 2, syntax information 2;
To generate annotation data 3, syntax information 3."
Referring to fig. 3, preferably, in the step S1, the audited regular expression is set to be a white regular expression, and then when a large language model is used to generate a corresponding black regular expression according to the white regular expression, the following steps are adopted:
B1: the generated regular expressions are set to be white regular expressions after passing the auditing and are stored in a regular expression set of the text classification label;
B2: carrying out regular expression matching by adopting a white regular expression of a text classification label and question-answer data without the text classification label;
B3: and inputting the matched question-answer data into a large language model for secondary classification judgment, screening out question-answer data which is not matched with the semantics of the corresponding text classification label, acquiring the syntax information of the question-answer data and a preset regular expression normal form, inputting the syntax information and the preset regular expression normal form into the large language model, generating a plurality of black regular expressions, and storing the black regular expressions into a regular expression set of the text classification label.
When the embodiment works, the robustness of the white regular expression can be improved through auditing the regular expression, the classification accuracy can be improved, meanwhile, the black regular expression is generated through secondary classification judgment of the large language model, the incorrectly matched text classification labels can be filtered, and the text classification accuracy is further improved.
In specific implementation, the following hint words may be input into a large language model to generate a black regular expression:
"for the mismatching annotation data 1, on the regular basis of regular expression matching, a regular expression with more definition is generated by combining with syntax information, is used for matching the question sentence, and directly returns to the regular expression. "
In this embodiment, the large language model is preferably ChatGLM-6B dialogue language model, but any other existing large language model may be used.
Referring to fig. 4, preferably, in the step S4, matching is performed between the text data and a plurality of white regular expressions in the regular expression set, filtering is performed between text classification labels that are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and then when matched text classification labels are added to the text data, the following steps are adopted:
F1: traversing the white regular expressions of all text classification labels to match with the text data to be classified, turning to step F2 when the matched white regular expressions exist in the text data, and ending the text classification when the matched white regular expressions do not exist in the text data;
f2: traversing all black regular expressions of text classification labels matched with the text data, marking the text data by adopting the text classification labels when no matched black regular expression exists and the text data is matched with only one white regular expression of the text classification labels, switching to a step F3 when no matched black regular expression exists and the text data is matched with a plurality of white regular expressions of a plurality of text classification labels, and switching to a step F4 when the matched black regular expression exists;
F3: generating semantic vector representation for the text data by adopting a preset sentence vector reasoning model, recalling matched annotation data in a semantic space according to the vector representation of the text data, selecting a plurality of annotation data with text classification labels, and then acquiring the mode of the text classification labels and setting the mode as the text classification label of the text data;
F4: after filtering text classification labels with matched black regular expressions, when a plurality of white regular expressions matched with a plurality of text classification labels exist, turning to step F3, when the text classification labels are matched with the white regular expressions of one text classification label, marking the text data by adopting the text classification labels, and when the matched text classification labels do not exist, ending the text classification.
Embodiment two:
The embodiment provides a text classification method of a regular expression generated based on a large language model, and the same points as the first embodiment are not described in detail.
The embodiment further comprises the step of training a sentence vector reasoning model:
c1: setting questions with the same answer or the same kind of answer in question and answer data as positive samples, and setting questions with different answers or different kinds of questions in question and answer data as negative samples;
C2: and combining a plurality of questions corresponding to the same answer into positive sample pairs in pairs, and training to obtain a sentence vector reasoning model by adopting the positive sample pairs and the negative sample fine tuning base model through a comparison learning method.
When the method and the device work, the matching degree of the sentence vector reasoning model and text data to be classified can be improved by constructing a proper training set fine tuning training sentence vector reasoning model, the semantic vector representation is more accurate when being generated, the method and the device have strong distinguishing property, and positive samples and negative samples are conveniently distinguished.
In this embodiment, the base model is preferably ERNIE bi-directional semantic representation model, and before the base model is trimmed, the training data is converted into data which can be received by the base model by adopting word segmentation, adding special start and end markers and performing necessary filling or cutting-off, and the selected positive sample pair is set as the positive sample pairSetting the selected negative sample as the negative sample
Vector encoding of ERNIE can be obtained and recorded as/>Meanwhile, a similarity function is defined as cosine similarity, namely:
the triplet loss function in the training process can be defined as:
Wherein: To space, the effect is to try to pull the difference between the similarity of the positive samples and the similarity of the negative samples apart, and the value can be dynamically attenuated and changed as the number of iterations of training increases.
In the fine tuning optimization process, the model parameter θ needs to be found to minimize the sum of the loss functions of all pairs of samples, namely:
wherein, Is a parameter of ERNIE models.
Embodiment III:
The embodiment provides a text classification method of a regular expression generated based on a large language model, and the same points as the first embodiment are not described in detail.
The embodiment further includes the step of creating a vector index library:
D1: generating semantic vector representation for the questions in the question-answering data by adopting a sentence vector reasoning model;
d2: a hash algorithm is adopted to respectively generate a corresponding identifier for the questions in each question-answer data;
D3: vector representations of a number of questions are stored as a vector index library in one-to-one correspondence with corresponding identifiers.
When the embodiment works, the recall efficiency can be improved, the training efficiency can be improved, and the training period can be further shortened by establishing the vector index library which is high-efficiency and can be quickly searched.
In specific implementation, the hash algorithm is preferably a SHA-256 hash algorithm, and the FAISS library can be used as a framework to build a vector index library, and of course, any other existing hash algorithm or library can be adopted.
Embodiment four:
The embodiment provides a text classification method of a regular expression generated based on a large language model, and the same points as the first embodiment are not described in detail.
The embodiment further includes the step of establishing a keyword lexicon:
e1: word segmentation is carried out on the labeling data and the question-answer data by adopting a word segmentation tool, a plurality of words are obtained and stored as a data set;
E2: analyzing the data set by adopting a TF-IDF algorithm, and assigning values to a plurality of words;
e3: setting vocabulary with weight value larger than a preset keyword threshold as keywords;
e4: training the data set by adopting a word embedding technology to obtain vector representation of each vocabulary;
E5: expanding the keywords, screening semantically matched words based on similarity of vector representation, and storing the words as a keyword word stock.
When the embodiment works, word segmentation of training data can be achieved, keywords can be extracted, and semantic clues can be provided when regular expressions are generated for a large language model by establishing a keyword word stock.
In the specific implementation, the word segmentation tool is preferably jieba, and the word embedding technology is preferably FastText or GloVe, and any existing word segmentation tool or word embedding technology can be selected according to actual requirements.
Fifth embodiment:
The embodiment provides a text classification method of a regular expression generated based on a large language model, and the same points as the first embodiment are not described in detail.
In the step S3, when the large language model is adopted to judge the semantic integrity of the text data, the large language model uses the preset small sample learning corpus and the thinking chain prompt words to judge the semantic integrity of the text data to be classified.
When the embodiment works, the small sample learning corpus and the thinking chain prompt words are input to the large language model, so that the thinking chain capability of the large language model can be fully exerted, the large language model can comprehensively evaluate the semantic integrity of text data, and the accuracy is high.
In specific implementation, the related thought chain prompting words can adopt objects, parts, attributes and the like, for example, the following thought chain prompting words can be adopted to realize judgment of semantic integrity:
"example sentence 1: whether the cream is allergic or not;
thinking chain prompt words: the key point of the method is that the client inquires whether the face cream is allergic, the product is the face cream, and the product characteristics are whether the face cream is allergic, so that the inquiry intention of the client is completely expressed;
conclusion: and (3) completing.
Example sentence 2: i skin is too dry;
thinking chain prompt words: the key point of the sentence is that the customer states the skin state of the customer, the description object is the customer, the part is skin, and the attribute is dry, so that the query intention of the customer is completely expressed;
conclusion: and (3) completing.
Example sentence 3: i dislike the feeling of wet on the face;
Thinking chain prompt words: the key point of the sentence is that the client states the preference of the client, the description object is the client, the part is the face, the attribute is wet, and the emotion is dislike, so that the query intention of the client is completely expressed;
conclusion: and (3) completing.
Example sentence 4: what is meant by the instant? Can be used;
Thinking chain prompt words: the key point of the sentence is that the customer inquires about the availability of the product, the description of the product is unclear, the inquiry purpose is unclear, the emotion is negative, the expression is possibly not full, the function efficacy of the inquired product is also possibly possible, and the inquiry intention of the user at the moment needs to be analyzed in combination with the context;
Conclusion: incomplete.
Example sentence 5: you do nothing;
thinking chain prompt words: the key point of the sentence is that the customer expresses the product in a non-going way, the description of the product is unclear, the emotion is negative, the expression is possibly insufficient, the function and the efficacy of the product are possibly inquired, and the inquiry intention of the user at the moment needs to be analyzed in combination with the context;
Conclusion: incomplete.
Combining the process of semantic integrity reasoning analysis of the example sentences and the final conclusion of the semantic integrity, deducing the semantic integrity of the following sentences in detail, and giving a conclusion, complete or incomplete: text data. "
The beneficial technical effects of the invention include: according to the text classification method, the regular expression set generated by the large language model can quickly and accurately complete classification of text data, meanwhile, the text classification labels which are incorrectly matched can be filtered out according to the white regular expression and the black regular expression which are matched with the text data, and classification accuracy can be further improved.
While the invention has been described in terms of embodiments, it will be appreciated by those skilled in the art that the invention is not limited thereto but rather includes the drawings and the description of the embodiments above. Any modifications which do not depart from the functional and structural principles of the present invention are intended to be included within the scope of the appended claims.

Claims (9)

1. A text classification method of regular expressions generated based on a large language model is characterized by comprising the following steps:
s1: initializing a text classification method, defining text classification labels, generating a regular expression set comprising regular expressions of each of a plurality of text classification labels by adopting a large language model, setting the audited regular expressions as white regular expressions, and then generating corresponding black regular expressions by adopting the large language model according to the white regular expressions;
S2: acquiring text data to be classified;
s3: judging the semantic integrity of the text data by adopting a large language model, and filtering text data with incomplete semantics;
S4: matching the text data according to a plurality of white regular expressions in the regular expression set, filtering text classification labels which are not matched with the text data according to a plurality of black regular expressions in the regular expression set, and adding the matched text classification labels to the text data, wherein the following steps are adopted:
F1: traversing the white regular expressions of all text classification labels to match with the text data to be classified, turning to step F2 when the matched white regular expressions exist in the text data, and ending the text classification when the matched white regular expressions do not exist in the text data;
f2: traversing all black regular expressions of text classification labels matched with the text data, marking the text data by adopting the text classification labels when no matched black regular expression exists and the text data is matched with only one white regular expression of the text classification labels, switching to a step F3 when no matched black regular expression exists and the text data is matched with a plurality of white regular expressions of a plurality of text classification labels, and switching to a step F4 when the matched black regular expression exists;
F3: generating semantic vector representation for the text data by adopting a preset sentence vector reasoning model, recalling matched annotation data in a semantic space according to the vector representation of the text data, selecting a plurality of annotation data with text classification labels, and then acquiring the mode of the text classification labels and setting the mode as the text classification label of the text data;
F4: after filtering text classification labels with matched black regular expressions, when a plurality of white regular expressions matched with a plurality of text classification labels exist, turning to step F3, when the text classification labels are matched with the white regular expressions of one text classification label, marking the text data by adopting the text classification labels, and when the matched text classification labels do not exist, ending the text classification.
2. The text classification method of regular expressions generated based on a large language model according to claim 1, further comprising the steps of obtaining labeling data with text classification labels and question-answer data without text classification labels, wherein each question-answer data comprises questions and answers, screening the question-answer data through a plurality of preset preprocessing rules, and filtering the question-answer data irrelevant to the text classification.
3. The method for classifying text according to claim 2, wherein in the step S1, when a large language model is used to generate a regular expression set including regular expressions of each of a plurality of text classification labels, the following steps are adopted:
A1: generating semantic vector representation for the annotation data by adopting a preset sentence vector reasoning model;
A2: recalling matched question-answer data in a semantic space in a preset vector index library according to the vector representation of each annotation data;
A3: inputting the recalled question-answer data into a large language model for secondary classification judgment, filtering the question-answer data which is not matched with the semantics of the corresponding text classification label, marking the question-answer data which is matched with the semantics of the corresponding text classification label, and setting the marked data;
A4: classifying the labeling data belonging to the same text classification tag through a preset keyword word library, generating a syntax tree for the labeling data by adopting a syntax analysis tool, capturing the syntax information and semantic information of the labeling data through the syntax tree of each labeling data, and storing the syntax information and semantic information in one-to-one correspondence with the labeling data;
A5: and inputting a plurality of marking data belonging to the same keyword, respective syntax information of the plurality of marking data and a preset regular expression paradigm into the large language model, generating a plurality of regular expressions, and storing the plurality of regular expressions into a regular expression set of text classification labels corresponding to the keyword.
4. The method for classifying text based on regular expressions generated by a large language model according to claim 3, wherein in the step S1, when the audited regular expressions are set to be white regular expressions, and then corresponding black regular expressions are generated by using the large language model according to the white regular expressions, the following steps are adopted:
B1: the generated regular expressions are set to be white regular expressions after passing the auditing and are stored in a regular expression set of the text classification label;
B2: carrying out regular expression matching by adopting a white regular expression of a text classification label and question-answer data without the text classification label;
B3: and inputting the matched question-answer data into a large language model for secondary classification judgment, screening out question-answer data which is not matched with the semantics of the corresponding text classification label, acquiring the syntax information of the question-answer data and a preset regular expression normal form, inputting the syntax information and the preset regular expression normal form into the large language model, generating a plurality of black regular expressions, and storing the black regular expressions into a regular expression set of the text classification label.
5. A method for text classification of regular expressions generated based on a large language model as claimed in claim 3, wherein the training of the sentence vector inference model comprises the steps of:
c1: setting questions with the same answer or the same kind of answer in question and answer data as positive samples, and setting questions with different answers or different kinds of questions in question and answer data as negative samples;
C2: and combining a plurality of questions corresponding to the same answer into positive sample pairs in pairs, and training to obtain a sentence vector reasoning model by adopting the positive sample pairs and the negative sample fine tuning base model through a comparison learning method.
6. A method for classifying text based on regular expressions generated by a large language model as claimed in claim 3, wherein when the vector index library is built, the following steps are adopted:
D1: generating semantic vector representation for the questions in the question-answering data by adopting a sentence vector reasoning model;
d2: a hash algorithm is adopted to respectively generate a corresponding identifier for the questions in each question-answer data;
D3: vector representations of a number of questions are stored as a vector index library in one-to-one correspondence with corresponding identifiers.
7. The text classification method of regular expressions generated based on a large language model as claimed in claim 3, wherein when the keyword lexicon is built, the following steps are adopted:
e1: word segmentation is carried out on the labeling data and the question-answer data by adopting a word segmentation tool, a plurality of words are obtained and stored as a data set;
E2: analyzing the data set by adopting a TF-IDF algorithm, and assigning values to a plurality of words;
e3: setting vocabulary with weight value larger than a preset keyword threshold as keywords;
e4: training the data set by adopting a word embedding technology to obtain vector representation of each vocabulary;
E5: expanding the keywords, screening semantically matched words based on similarity of vector representation, and storing the words as a keyword word stock.
8. The text classification method of regular expressions generated based on large language models according to claim 1, wherein in the step S2, after obtaining text data to be classified, further comprising a step of data preprocessing, filtering the text data to be classified by a plurality of preset preprocessing rules, and filtering text data irrelevant to the current text classification.
9. The method for classifying text based on regular expressions generated by a large language model according to claim 1, wherein in the step S3, when the large language model is used to determine the semantic integrity of text data, the large language model uses the preset small sample learning corpus and the preset thinking chain prompt word to input the small sample learning corpus and the preset thinking chain prompt word to the large language model, and the large language model uses the preset thinking chain prompt word to determine the semantic integrity of the text data to be classified.
CN202410034646.5A 2024-01-10 2024-01-10 Text classification method of regular expression generated based on large language model Active CN117556049B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410034646.5A CN117556049B (en) 2024-01-10 2024-01-10 Text classification method of regular expression generated based on large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410034646.5A CN117556049B (en) 2024-01-10 2024-01-10 Text classification method of regular expression generated based on large language model

Publications (2)

Publication Number Publication Date
CN117556049A CN117556049A (en) 2024-02-13
CN117556049B true CN117556049B (en) 2024-05-17

Family

ID=89820826

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410034646.5A Active CN117556049B (en) 2024-01-10 2024-01-10 Text classification method of regular expression generated based on large language model

Country Status (1)

Country Link
CN (1) CN117556049B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104217717A (en) * 2013-05-29 2014-12-17 腾讯科技(深圳)有限公司 Language model constructing method and device
CN106446230A (en) * 2016-10-08 2017-02-22 国云科技股份有限公司 Method for optimizing word classification in machine learning text
CN108182234A (en) * 2017-12-27 2018-06-19 中科鼎富(北京)科技发展有限公司 Regular expression screening technique and device
CN112364660A (en) * 2020-10-27 2021-02-12 中国平安人寿保险股份有限公司 Corpus text processing method and device, computer equipment and storage medium
CN113111234A (en) * 2020-02-13 2021-07-13 北京明亿科技有限公司 Regular expression-based alarm condition category determination method and device
CN113761903A (en) * 2020-06-05 2021-12-07 国家计算机网络与信息安全管理中心 Text screening method for high-volume high-noise spoken short text
CN114595332A (en) * 2022-03-30 2022-06-07 阳光保险集团股份有限公司 Text classification prediction method and device and electronic equipment
CN114818891A (en) * 2022-04-14 2022-07-29 人民网股份有限公司 Small sample multi-label text classification model training method and text classification method
CN116561311A (en) * 2023-04-21 2023-08-08 武汉大学 Automatic classification method for quotation text based on large language model
US11748577B1 (en) * 2022-08-22 2023-09-05 Rohirrim, Inc. Computer-generated content based on text classification, semantic relevance, and activation of deep learning large language models

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017039603A1 (en) * 2015-08-31 2017-03-09 Hewlett Packard Enterprise Development Lp Domain classification
CN112487182B (en) * 2019-09-12 2024-04-12 华为技术有限公司 Training method of text processing model, text processing method and device
US20230419037A1 (en) * 2022-06-24 2023-12-28 Salesforce, Inc. Systems and methods for text classification using label modular prompts

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104217717A (en) * 2013-05-29 2014-12-17 腾讯科技(深圳)有限公司 Language model constructing method and device
CN106446230A (en) * 2016-10-08 2017-02-22 国云科技股份有限公司 Method for optimizing word classification in machine learning text
CN108182234A (en) * 2017-12-27 2018-06-19 中科鼎富(北京)科技发展有限公司 Regular expression screening technique and device
CN113111234A (en) * 2020-02-13 2021-07-13 北京明亿科技有限公司 Regular expression-based alarm condition category determination method and device
CN113761903A (en) * 2020-06-05 2021-12-07 国家计算机网络与信息安全管理中心 Text screening method for high-volume high-noise spoken short text
CN112364660A (en) * 2020-10-27 2021-02-12 中国平安人寿保险股份有限公司 Corpus text processing method and device, computer equipment and storage medium
CN114595332A (en) * 2022-03-30 2022-06-07 阳光保险集团股份有限公司 Text classification prediction method and device and electronic equipment
CN114818891A (en) * 2022-04-14 2022-07-29 人民网股份有限公司 Small sample multi-label text classification model training method and text classification method
US11748577B1 (en) * 2022-08-22 2023-09-05 Rohirrim, Inc. Computer-generated content based on text classification, semantic relevance, and activation of deep learning large language models
CN116561311A (en) * 2023-04-21 2023-08-08 武汉大学 Automatic classification method for quotation text based on large language model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Enabling Digital Transformation through Business Text Classification with Small Datasets;Muhammad Arslan etc.;《 2023 15th International Conference on Innovations in Information Technology (IIT)》;20231225;第38-42页 *
Leveraging Large Language Models for Topic Classification in the Domain of Public Affairs;Pena, A etc.;《 arXiv》;20230901;第1-12页 *
基于医学大数据的预训练语言模型及其医学文本分类研究;黄敏婷 等;《中华医学图书情报杂志》;20201231;第39-46页 *
大规模短文本的分类过滤方法研究;吴薇;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20070515;第I138-1565页 *

Also Published As

Publication number Publication date
CN117556049A (en) 2024-02-13

Similar Documents

Publication Publication Date Title
CN109635297B (en) Entity disambiguation method and device, computer device and computer storage medium
CN111241294A (en) Graph convolution network relation extraction method based on dependency analysis and key words
CN110895559A (en) Model training method, text processing method, device and equipment
CN110717045A (en) Letter element automatic extraction method based on letter overview
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN111143531A (en) Question-answer pair construction method, system, device and computer readable storage medium
CN113312922B (en) Improved chapter-level triple information extraction method
CN112115252A (en) Intelligent auxiliary writing processing method and device, electronic equipment and storage medium
CN114997181A (en) Intelligent question-answering method and system based on user feedback correction
CN112347339A (en) Search result processing method and device
CN117251524A (en) Short text classification method based on multi-strategy fusion
CN112101014A (en) Chinese chemical industry document word segmentation method based on mixed feature fusion
CN111460147A (en) Title short text classification method based on semantic enhancement
CN113361252B (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
CN117556049B (en) Text classification method of regular expression generated based on large language model
CN116244277A (en) NLP (non-linear point) identification and knowledge base construction method and system
CN114117069A (en) Semantic understanding method and system for intelligent knowledge graph question answering
CN111027308A (en) Text generation method, system, mobile terminal and storage medium
CN110852104B (en) Family tree identification method and device, storage medium and processor
CN116227496B (en) Deep learning-based electric public opinion entity relation extraction method and system
CN117453895B (en) Intelligent customer service response method, device, equipment and readable storage medium
CN117609419A (en) Domain retrieval method based on meta learning and knowledge enhancement
CN118113806A (en) Interpretable event context generation method for large model retrieval enhancement generation
CN118036725A (en) Atlas generation method and system based on text big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant