CN113536768A - Method and equipment for establishing text extraction model based on regular expression - Google Patents

Method and equipment for establishing text extraction model based on regular expression Download PDF

Info

Publication number
CN113536768A
CN113536768A CN202110797247.0A CN202110797247A CN113536768A CN 113536768 A CN113536768 A CN 113536768A CN 202110797247 A CN202110797247 A CN 202110797247A CN 113536768 A CN113536768 A CN 113536768A
Authority
CN
China
Prior art keywords
text extraction
extraction model
model
corpus
regular expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110797247.0A
Other languages
Chinese (zh)
Inventor
苏江文
王燕蓉
陈江海
张垚
庄莉
梁懿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Information and Telecommunication Co Ltd
Fujian Yirong Information Technology Co Ltd
Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Original Assignee
State Grid Information and Telecommunication Co Ltd
Fujian Yirong Information Technology Co Ltd
Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Information and Telecommunication Co Ltd, Fujian Yirong Information Technology Co Ltd, Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd filed Critical State Grid Information and Telecommunication Co Ltd
Priority to CN202110797247.0A priority Critical patent/CN113536768A/en
Publication of CN113536768A publication Critical patent/CN113536768A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method for establishing a text extraction model based on a regular expression, which comprises the following steps: s1, writing a plurality of regular expressions; s2, extracting a corpus from the corpus according to the regular expression; s3, dividing the corpus into a training set and a verification set; s4, constructing a text extraction model; s5, inputting the training set into a text extraction model, and training the text extraction model; and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.

Description

Method and equipment for establishing text extraction model based on regular expression
Technical Field
The invention relates to a method and equipment for establishing a text extraction model based on a regular expression, belonging to the field of natural language processing.
Background
Regular expressions are a description of string rules and are commonly used to retrieve and replace text that conforms to a rule. For example, the regular expression for extracting e-mail is: the regular expression identifies email addresses in the format xxxx @ xxxx.xxx.w) + (\\ w +). The regular expression is flexible in expression and can be matched with characters in almost any mode. However, the use of regular expressions presupposes that the "pattern" or "rule" of the information to be extracted is well defined. And therefore not applicable to key information extraction in text without explicit rules.
In the process of establishing the supervised text extraction model, iterative training occupies a large amount of time, the used training data determines the performance of the model to a certain extent, and a large amount of training data needs manual labeling.
A conditional random field model (CRF) is one of the supervised text extraction models, and is commonly used for labeling the part of speech of a word in a corpus (for example, labeling named entities or verbs, nouns and the like in the corpus). The CRF model has strong extraction capability on key information without obvious modes (specific rules are difficult to observe manually). However, the accuracy of the CRF model is not determined by itself, but mainly depends on whether the labeled corpus used for training is consistent with the target test corpus, more manual labeled corpora need to be prepared in advance, the extraction effect is unstable, the accuracy is difficult to predict, and the CRF model is not suitable for a scene with a strict requirement on the extraction accuracy.
Patent publication No. CN201910455064.3, keyword corpus annotation training extraction tool, discloses an annotation training tool capable of reducing the complexity of manual annotation process and improving efficiency and accuracy of mass keyword corpus annotation. The method comprises the following steps: the method comprises the steps that a keyword corpus labeling preparation module distinguishes mass corpus data from different sources, a semi-automatic corpus keyword labeling module creates a keyword labeling task, an adaptive algorithm is selected autonomously, automatic labeling based on an algorithm model is carried out, pre-labeling processing is carried out on text corpus data to be labeled through integrating at least one keyword extraction algorithm of CHI, LDA, TEXTRANK and TFIDF, labeling results of multiple algorithms are fused, and a feedback type keyword labeling model learning and training module trains a keyword labeling algorithm model after the labeling task is completed; and the keyword labeling model effect evaluation module automatically evaluates the quantitative labeling effect of the model index.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method for establishing a text extraction model based on regular expressions, and by compiling a small number of regular expressions to replace manual labeling, the labor cost and time required for establishing a supervised text extraction model are effectively reduced.
The technical scheme of the invention is as follows:
the first technical scheme is as follows:
a method for establishing a text extraction model based on a regular expression comprises the following steps:
s1, writing a plurality of regular expressions;
s2, extracting a corpus from the corpus according to the regular expression;
s3, dividing the corpus into a training set and a verification set;
s4, constructing a text extraction model;
s5, inputting the training set into a text extraction model, and training the text extraction model;
and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.
Further, the text extraction model is a CRF model.
Further, a threshold value is set in step S6; if the accuracy of the verification model is lower than the threshold, go to step S1.
The second technical scheme is as follows:
a regular expression based text extraction model building apparatus comprising a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of:
S1, writing a plurality of regular expressions;
s2, extracting a corpus from the corpus according to the regular expression;
s3, dividing the corpus into a training set and a verification set;
s4, constructing a text extraction model;
s5, inputting the training set into a text extraction model, and training the text extraction model;
and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.
Further, the text extraction model is a CRF model.
Further, a threshold value is set in step S6; if the accuracy of the verification model is lower than the threshold, go to step S1.
The invention has the following beneficial effects:
1. according to the invention, manual marking is replaced by writing a small number of regular expressions, so that the labor cost and time required for establishing the model are effectively reduced.
2. The invention combines the advantages of the regular expression and the CRF model, can efficiently and accurately extract the key information in the text, and is specifically embodied in that:
based on the characteristics of the regular expression, the method has better effect in the field of processing the text with the fixed template, such as the auditing field and the patent field. Meanwhile, the text extraction model is used as an executor for extracting final text information, and is not limited by whether the information to be extracted has a strict template or not, and the extraction range is far higher than that of a method based on a 'regular expression', so that the method can be suitable for various fields.
3. The invention adds a small number of regular expressions, and then repeats steps S1-S6 to retrain the CRF model. The CRF model extraction effect can be effectively improved, and the rules compiled earlier can not be abandoned.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flowchart of a fourth embodiment.
Detailed Description
The invention is described in detail below with reference to the figures and the specific embodiments.
Example one
Referring to fig. 1, a method for building a text extraction model based on a regular expression includes the following steps:
s1, writing a plurality of regular expressions;
s2, extracting a corpus from the corpus according to the regular expression;
s3, dividing the corpus set into a training set (80%) and a verification set (20%);
s4, constructing a text extraction model;
s5, inputting the training set into a text extraction model, and training the text extraction model;
and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.
The beneficial effect of this embodiment lies in replacing artifical mark through writing a small amount of regular expressions, effectively reduces the human cost and the time that the model of establishment needs.
Example two
Further, the text extraction model is a CRF model.
In this example, the CRF model was constructed using an open-source "python-creating" development kit.
The embodiment has the advantages that the advantages of the regular expression and the CRF model are combined, the key information in the text can be efficiently and accurately extracted, and the method is specifically embodied in that:
based on the characteristics of the regular expression, the method has better effect in the field of processing the text with the fixed template, such as the auditing field and the patent field. Meanwhile, the text extraction model is used as an executor for extracting final text information, and is not limited by whether the information to be extracted has a strict template or not, and the extraction range is far higher than that of a method based on a 'regular expression', so that the method can be suitable for various fields.
EXAMPLE III
Further, a threshold (in this embodiment, the threshold is set to be 90%) is set on the CRF model, and if the model accuracy is lower than 90%, the step S1 is skipped.
The improvement of the present embodiment is that a small number of regular expressions are added, and the steps S1 to S6 are repeated to retrain the CRF model. The CRF model extraction effect can be effectively improved, and the rules compiled earlier can not be abandoned.
Example four
As shown in fig. 2, take an enterprise bidding specification as an example.
Setting the extraction targets as: pay-on-site address.
According to the extraction target, writing a regular expression: site pay-cost address (#. This regular expression can match text with the same "pattern" in the corpus, i.e.: xxxxx "is the address of the fee paid on-site.
And executing regular expression extraction, and extracting a corpus from the corpus. The corpus comprises matching texts and key field information in the matching texts. And (3) dividing the corpus set into a training set (80%) and a verification set (20%), and constructing a text extraction model.
And inputting the key field information in the training set and 30 words (obtained from the matching text) before and after the key field information into the CRF model for training. And verified through the verification set.
The finally obtained CRF model can not only extract a text containing a site payment address xxxxx, but also extract a text containing a bid address: xxxxxx "such sentences that do not conform to the written regular expression. This is because the CRF algorithm can make a judgment based on the context information (the first and second 30 words of the input are contexts), and make up for the deficiency of the regular expression.
EXAMPLE five
A regular expression based text extraction model building apparatus comprising a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of:
referring to fig. 1, a method for building a text extraction model based on a regular expression includes the following steps:
s1, writing a plurality of regular expressions;
s2, extracting a corpus from the corpus according to the regular expression;
s3, dividing the corpus set into a training set (80%) and a verification set (20%);
s4, constructing a text extraction model;
s5, inputting the training set into a text extraction model, and training the text extraction model;
and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.
The beneficial effect of this embodiment lies in replacing artifical mark through writing a small amount of regular expressions, effectively reduces the human cost and the time that the model of establishment needs.
EXAMPLE six
Further, the text extraction model is a CRF model.
In this example, the CRF model was constructed using an open-source "python-creating" development kit.
The embodiment has the advantages that the advantages of the regular expression and the CRF model are combined, the key information in the text can be efficiently and accurately extracted, and the method is specifically embodied in that:
Based on the characteristics of the regular expression, the method has better effect in the field of processing the text with the fixed template, such as the auditing field and the patent field. Meanwhile, the text extraction model is used as an executor for extracting final text information, and is not limited by whether the information to be extracted has a strict template or not, and the extraction range is far higher than that of a method based on a 'regular expression', so that the method can be suitable for various fields.
EXAMPLE seven
Further, a threshold (in this embodiment, the threshold is set to be 90%) is set on the CRF model, and if the model accuracy is lower than 90%, the step S1 is skipped.
The improvement of the present embodiment is that a small number of regular expressions are added, and the steps S1 to S6 are repeated to retrain the CRF model. The CRF model extraction effect can be effectively improved, and the rules compiled earlier can not be abandoned.
Example eight
As shown in fig. 2, take an enterprise bidding specification as an example.
Setting the extraction targets as: pay-on-site address.
According to the extraction target, writing a regular expression: site pay-cost address (#. This regular expression can match text with the same "pattern" in the corpus, i.e.: xxxxx "is the address of the fee paid on-site.
And executing regular expression extraction, and extracting a corpus from the corpus. The corpus comprises matching texts and key field information in the matching texts. And (3) dividing the corpus set into a training set (80%) and a verification set (20%), and constructing a text extraction model.
And inputting the key field information in the training set and 30 words (obtained from the matching text) before and after the key field information into the CRF model for training. And verified through the verification set.
The finally obtained CRF model can not only extract a text containing a site payment address xxxxx, but also extract a text containing a bid address: xxxxxx "such sentences that do not conform to the written regular expression. This is because the CRF algorithm can make a judgment based on the context information (the first and second 30 words of the input are contexts), and make up for the deficiency of the regular expression.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (4)

1. A method for establishing a text extraction model based on a regular expression is characterized by comprising the following steps:
S1, writing a plurality of regular expressions;
s2, extracting a corpus from the corpus according to each regular expression;
s3, dividing the corpus into a training set and a verification set;
s4, constructing a text extraction model;
s5, inputting the training set into a text extraction model, and training the text extraction model;
and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.
2. The method of claim 1, wherein the text extraction model is a CRF model.
3. The method for building a text extraction model based on regular expressions according to claim 2, wherein a threshold is further set in step S6; if the accuracy of the verification model is lower than the threshold, go to step S1.
4. An apparatus for building a text extraction model based on regular expressions, comprising a memory and a processor, wherein the memory stores instructions adapted to be loaded by the processor and to perform a method for building a text extraction model based on regular expressions according to any one of claims 1 to 3.
CN202110797247.0A 2021-07-14 2021-07-14 Method and equipment for establishing text extraction model based on regular expression Pending CN113536768A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110797247.0A CN113536768A (en) 2021-07-14 2021-07-14 Method and equipment for establishing text extraction model based on regular expression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110797247.0A CN113536768A (en) 2021-07-14 2021-07-14 Method and equipment for establishing text extraction model based on regular expression

Publications (1)

Publication Number Publication Date
CN113536768A true CN113536768A (en) 2021-10-22

Family

ID=78099155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110797247.0A Pending CN113536768A (en) 2021-07-14 2021-07-14 Method and equipment for establishing text extraction model based on regular expression

Country Status (1)

Country Link
CN (1) CN113536768A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977345A (en) * 2017-11-14 2018-05-01 福建亿榕信息技术有限公司 A kind of generic text information abstracting method and system
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977345A (en) * 2017-11-14 2018-05-01 福建亿榕信息技术有限公司 A kind of generic text information abstracting method and system
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN106649783B (en) Synonym mining method and device
EP3508992A1 (en) Error correction method and device for search term
US20210319051A1 (en) Conversation oriented machine-user interaction
CN110795938B (en) Text sequence word segmentation method, device and storage medium
CN111159415B (en) Sequence labeling method and system, and event element extraction method and system
CN110765759B (en) Intention recognition method and device
CN111159414B (en) Text classification method and system, electronic equipment and computer readable storage medium
CN107608951B (en) Report generation method and system
US20190317986A1 (en) Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
CN111738002A (en) Ancient text field named entity identification method and system based on Lattice LSTM
CN104866472A (en) Generation method and device of word segmentation training set
CN107977345A (en) A kind of generic text information abstracting method and system
CN117149984B (en) Customization training method and device based on large model thinking chain
CN112307048A (en) Semantic matching model training method, matching device, equipment and storage medium
CN115879450B (en) Gradual text generation method, system, computer equipment and storage medium
CN113806489A (en) Method, electronic device and computer program product for dataset creation
CN111178018B (en) Deep learning-based target soft text generation method and device
CN110442858B (en) Question entity identification method and device, computer equipment and storage medium
CN116186223A (en) Financial text processing method, device, equipment and storage medium
CN113536768A (en) Method and equipment for establishing text extraction model based on regular expression
CN113486169B (en) Synonymous statement generation method, device, equipment and storage medium based on BERT model
CN116796796A (en) GPT architecture-based automatic document generation method and device
CN115658885A (en) Intelligent text labeling method and system, intelligent terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination