CN113536768A - Method and equipment for establishing text extraction model based on regular expression - Google Patents
Method and equipment for establishing text extraction model based on regular expression Download PDFInfo
- Publication number
- CN113536768A CN113536768A CN202110797247.0A CN202110797247A CN113536768A CN 113536768 A CN113536768 A CN 113536768A CN 202110797247 A CN202110797247 A CN 202110797247A CN 113536768 A CN113536768 A CN 113536768A
- Authority
- CN
- China
- Prior art keywords
- text extraction
- extraction model
- model
- corpus
- regular expression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a method for establishing a text extraction model based on a regular expression, which comprises the following steps: s1, writing a plurality of regular expressions; s2, extracting a corpus from the corpus according to the regular expression; s3, dividing the corpus into a training set and a verification set; s4, constructing a text extraction model; s5, inputting the training set into a text extraction model, and training the text extraction model; and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.
Description
Technical Field
The invention relates to a method and equipment for establishing a text extraction model based on a regular expression, belonging to the field of natural language processing.
Background
Regular expressions are a description of string rules and are commonly used to retrieve and replace text that conforms to a rule. For example, the regular expression for extracting e-mail is: the regular expression identifies email addresses in the format xxxx @ xxxx.xxx.w) + (\\ w +). The regular expression is flexible in expression and can be matched with characters in almost any mode. However, the use of regular expressions presupposes that the "pattern" or "rule" of the information to be extracted is well defined. And therefore not applicable to key information extraction in text without explicit rules.
In the process of establishing the supervised text extraction model, iterative training occupies a large amount of time, the used training data determines the performance of the model to a certain extent, and a large amount of training data needs manual labeling.
A conditional random field model (CRF) is one of the supervised text extraction models, and is commonly used for labeling the part of speech of a word in a corpus (for example, labeling named entities or verbs, nouns and the like in the corpus). The CRF model has strong extraction capability on key information without obvious modes (specific rules are difficult to observe manually). However, the accuracy of the CRF model is not determined by itself, but mainly depends on whether the labeled corpus used for training is consistent with the target test corpus, more manual labeled corpora need to be prepared in advance, the extraction effect is unstable, the accuracy is difficult to predict, and the CRF model is not suitable for a scene with a strict requirement on the extraction accuracy.
Patent publication No. CN201910455064.3, keyword corpus annotation training extraction tool, discloses an annotation training tool capable of reducing the complexity of manual annotation process and improving efficiency and accuracy of mass keyword corpus annotation. The method comprises the following steps: the method comprises the steps that a keyword corpus labeling preparation module distinguishes mass corpus data from different sources, a semi-automatic corpus keyword labeling module creates a keyword labeling task, an adaptive algorithm is selected autonomously, automatic labeling based on an algorithm model is carried out, pre-labeling processing is carried out on text corpus data to be labeled through integrating at least one keyword extraction algorithm of CHI, LDA, TEXTRANK and TFIDF, labeling results of multiple algorithms are fused, and a feedback type keyword labeling model learning and training module trains a keyword labeling algorithm model after the labeling task is completed; and the keyword labeling model effect evaluation module automatically evaluates the quantitative labeling effect of the model index.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method for establishing a text extraction model based on regular expressions, and by compiling a small number of regular expressions to replace manual labeling, the labor cost and time required for establishing a supervised text extraction model are effectively reduced.
The technical scheme of the invention is as follows:
the first technical scheme is as follows:
a method for establishing a text extraction model based on a regular expression comprises the following steps:
s1, writing a plurality of regular expressions;
s2, extracting a corpus from the corpus according to the regular expression;
s3, dividing the corpus into a training set and a verification set;
s4, constructing a text extraction model;
s5, inputting the training set into a text extraction model, and training the text extraction model;
and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.
Further, the text extraction model is a CRF model.
Further, a threshold value is set in step S6; if the accuracy of the verification model is lower than the threshold, go to step S1.
The second technical scheme is as follows:
a regular expression based text extraction model building apparatus comprising a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of:
S1, writing a plurality of regular expressions;
s2, extracting a corpus from the corpus according to the regular expression;
s3, dividing the corpus into a training set and a verification set;
s4, constructing a text extraction model;
s5, inputting the training set into a text extraction model, and training the text extraction model;
and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.
Further, the text extraction model is a CRF model.
Further, a threshold value is set in step S6; if the accuracy of the verification model is lower than the threshold, go to step S1.
The invention has the following beneficial effects:
1. according to the invention, manual marking is replaced by writing a small number of regular expressions, so that the labor cost and time required for establishing the model are effectively reduced.
2. The invention combines the advantages of the regular expression and the CRF model, can efficiently and accurately extract the key information in the text, and is specifically embodied in that:
based on the characteristics of the regular expression, the method has better effect in the field of processing the text with the fixed template, such as the auditing field and the patent field. Meanwhile, the text extraction model is used as an executor for extracting final text information, and is not limited by whether the information to be extracted has a strict template or not, and the extraction range is far higher than that of a method based on a 'regular expression', so that the method can be suitable for various fields.
3. The invention adds a small number of regular expressions, and then repeats steps S1-S6 to retrain the CRF model. The CRF model extraction effect can be effectively improved, and the rules compiled earlier can not be abandoned.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flowchart of a fourth embodiment.
Detailed Description
The invention is described in detail below with reference to the figures and the specific embodiments.
Example one
Referring to fig. 1, a method for building a text extraction model based on a regular expression includes the following steps:
s1, writing a plurality of regular expressions;
s2, extracting a corpus from the corpus according to the regular expression;
s3, dividing the corpus set into a training set (80%) and a verification set (20%);
s4, constructing a text extraction model;
s5, inputting the training set into a text extraction model, and training the text extraction model;
and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.
The beneficial effect of this embodiment lies in replacing artifical mark through writing a small amount of regular expressions, effectively reduces the human cost and the time that the model of establishment needs.
Example two
Further, the text extraction model is a CRF model.
In this example, the CRF model was constructed using an open-source "python-creating" development kit.
The embodiment has the advantages that the advantages of the regular expression and the CRF model are combined, the key information in the text can be efficiently and accurately extracted, and the method is specifically embodied in that:
based on the characteristics of the regular expression, the method has better effect in the field of processing the text with the fixed template, such as the auditing field and the patent field. Meanwhile, the text extraction model is used as an executor for extracting final text information, and is not limited by whether the information to be extracted has a strict template or not, and the extraction range is far higher than that of a method based on a 'regular expression', so that the method can be suitable for various fields.
EXAMPLE III
Further, a threshold (in this embodiment, the threshold is set to be 90%) is set on the CRF model, and if the model accuracy is lower than 90%, the step S1 is skipped.
The improvement of the present embodiment is that a small number of regular expressions are added, and the steps S1 to S6 are repeated to retrain the CRF model. The CRF model extraction effect can be effectively improved, and the rules compiled earlier can not be abandoned.
Example four
As shown in fig. 2, take an enterprise bidding specification as an example.
Setting the extraction targets as: pay-on-site address.
According to the extraction target, writing a regular expression: site pay-cost address (#. This regular expression can match text with the same "pattern" in the corpus, i.e.: xxxxx "is the address of the fee paid on-site.
And executing regular expression extraction, and extracting a corpus from the corpus. The corpus comprises matching texts and key field information in the matching texts. And (3) dividing the corpus set into a training set (80%) and a verification set (20%), and constructing a text extraction model.
And inputting the key field information in the training set and 30 words (obtained from the matching text) before and after the key field information into the CRF model for training. And verified through the verification set.
The finally obtained CRF model can not only extract a text containing a site payment address xxxxx, but also extract a text containing a bid address: xxxxxx "such sentences that do not conform to the written regular expression. This is because the CRF algorithm can make a judgment based on the context information (the first and second 30 words of the input are contexts), and make up for the deficiency of the regular expression.
EXAMPLE five
A regular expression based text extraction model building apparatus comprising a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of:
referring to fig. 1, a method for building a text extraction model based on a regular expression includes the following steps:
s1, writing a plurality of regular expressions;
s2, extracting a corpus from the corpus according to the regular expression;
s3, dividing the corpus set into a training set (80%) and a verification set (20%);
s4, constructing a text extraction model;
s5, inputting the training set into a text extraction model, and training the text extraction model;
and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.
The beneficial effect of this embodiment lies in replacing artifical mark through writing a small amount of regular expressions, effectively reduces the human cost and the time that the model of establishment needs.
EXAMPLE six
Further, the text extraction model is a CRF model.
In this example, the CRF model was constructed using an open-source "python-creating" development kit.
The embodiment has the advantages that the advantages of the regular expression and the CRF model are combined, the key information in the text can be efficiently and accurately extracted, and the method is specifically embodied in that:
Based on the characteristics of the regular expression, the method has better effect in the field of processing the text with the fixed template, such as the auditing field and the patent field. Meanwhile, the text extraction model is used as an executor for extracting final text information, and is not limited by whether the information to be extracted has a strict template or not, and the extraction range is far higher than that of a method based on a 'regular expression', so that the method can be suitable for various fields.
EXAMPLE seven
Further, a threshold (in this embodiment, the threshold is set to be 90%) is set on the CRF model, and if the model accuracy is lower than 90%, the step S1 is skipped.
The improvement of the present embodiment is that a small number of regular expressions are added, and the steps S1 to S6 are repeated to retrain the CRF model. The CRF model extraction effect can be effectively improved, and the rules compiled earlier can not be abandoned.
Example eight
As shown in fig. 2, take an enterprise bidding specification as an example.
Setting the extraction targets as: pay-on-site address.
According to the extraction target, writing a regular expression: site pay-cost address (#. This regular expression can match text with the same "pattern" in the corpus, i.e.: xxxxx "is the address of the fee paid on-site.
And executing regular expression extraction, and extracting a corpus from the corpus. The corpus comprises matching texts and key field information in the matching texts. And (3) dividing the corpus set into a training set (80%) and a verification set (20%), and constructing a text extraction model.
And inputting the key field information in the training set and 30 words (obtained from the matching text) before and after the key field information into the CRF model for training. And verified through the verification set.
The finally obtained CRF model can not only extract a text containing a site payment address xxxxx, but also extract a text containing a bid address: xxxxxx "such sentences that do not conform to the written regular expression. This is because the CRF algorithm can make a judgment based on the context information (the first and second 30 words of the input are contexts), and make up for the deficiency of the regular expression.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (4)
1. A method for establishing a text extraction model based on a regular expression is characterized by comprising the following steps:
S1, writing a plurality of regular expressions;
s2, extracting a corpus from the corpus according to each regular expression;
s3, dividing the corpus into a training set and a verification set;
s4, constructing a text extraction model;
s5, inputting the training set into a text extraction model, and training the text extraction model;
and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.
2. The method of claim 1, wherein the text extraction model is a CRF model.
3. The method for building a text extraction model based on regular expressions according to claim 2, wherein a threshold is further set in step S6; if the accuracy of the verification model is lower than the threshold, go to step S1.
4. An apparatus for building a text extraction model based on regular expressions, comprising a memory and a processor, wherein the memory stores instructions adapted to be loaded by the processor and to perform a method for building a text extraction model based on regular expressions according to any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110797247.0A CN113536768A (en) | 2021-07-14 | 2021-07-14 | Method and equipment for establishing text extraction model based on regular expression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110797247.0A CN113536768A (en) | 2021-07-14 | 2021-07-14 | Method and equipment for establishing text extraction model based on regular expression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113536768A true CN113536768A (en) | 2021-10-22 |
Family
ID=78099155
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110797247.0A Pending CN113536768A (en) | 2021-07-14 | 2021-07-14 | Method and equipment for establishing text extraction model based on regular expression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113536768A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107977345A (en) * | 2017-11-14 | 2018-05-01 | 福建亿榕信息技术有限公司 | A kind of generic text information abstracting method and system |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
-
2021
- 2021-07-14 CN CN202110797247.0A patent/CN113536768A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107977345A (en) * | 2017-11-14 | 2018-05-01 | 福建亿榕信息技术有限公司 | A kind of generic text information abstracting method and system |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
CN111444320B (en) | Text retrieval method and device, computer equipment and storage medium | |
CN106649783B (en) | Synonym mining method and device | |
EP3508992A1 (en) | Error correction method and device for search term | |
US20210319051A1 (en) | Conversation oriented machine-user interaction | |
CN110795938B (en) | Text sequence word segmentation method, device and storage medium | |
CN111159415B (en) | Sequence labeling method and system, and event element extraction method and system | |
CN110765759B (en) | Intention recognition method and device | |
CN111159414B (en) | Text classification method and system, electronic equipment and computer readable storage medium | |
CN107608951B (en) | Report generation method and system | |
US20190317986A1 (en) | Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method | |
CN111738002A (en) | Ancient text field named entity identification method and system based on Lattice LSTM | |
CN104866472A (en) | Generation method and device of word segmentation training set | |
CN107977345A (en) | A kind of generic text information abstracting method and system | |
CN117149984B (en) | Customization training method and device based on large model thinking chain | |
CN112307048A (en) | Semantic matching model training method, matching device, equipment and storage medium | |
CN115879450B (en) | Gradual text generation method, system, computer equipment and storage medium | |
CN113806489A (en) | Method, electronic device and computer program product for dataset creation | |
CN111178018B (en) | Deep learning-based target soft text generation method and device | |
CN110442858B (en) | Question entity identification method and device, computer equipment and storage medium | |
CN116186223A (en) | Financial text processing method, device, equipment and storage medium | |
CN113536768A (en) | Method and equipment for establishing text extraction model based on regular expression | |
CN113486169B (en) | Synonymous statement generation method, device, equipment and storage medium based on BERT model | |
CN116796796A (en) | GPT architecture-based automatic document generation method and device | |
CN115658885A (en) | Intelligent text labeling method and system, intelligent terminal and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |