CN113536768A

CN113536768A - Method and equipment for establishing text extraction model based on regular expression

Info

Publication number: CN113536768A
Application number: CN202110797247.0A
Authority: CN
Inventors: 苏江文; 王燕蓉; 陈江海; 张垚; 庄莉; 梁懿
Original assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-10-22

Abstract

The invention relates to a method for establishing a text extraction model based on a regular expression, which comprises the following steps: s1, writing a plurality of regular expressions; s2, extracting a corpus from the corpus according to the regular expression; s3, dividing the corpus into a training set and a verification set; s4, constructing a text extraction model; s5, inputting the training set into a text extraction model, and training the text extraction model; and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.

Description

Method and equipment for establishing text extraction model based on regular expression

Technical Field

The invention relates to a method and equipment for establishing a text extraction model based on a regular expression, belonging to the field of natural language processing.

Background

Regular expressions are a description of string rules and are commonly used to retrieve and replace text that conforms to a rule. For example, the regular expression for extracting e-mail is: the regular expression identifies email addresses in the format xxxx @ xxxx.xxx.w) + (\\ w +). The regular expression is flexible in expression and can be matched with characters in almost any mode. However, the use of regular expressions presupposes that the "pattern" or "rule" of the information to be extracted is well defined. And therefore not applicable to key information extraction in text without explicit rules.

In the process of establishing the supervised text extraction model, iterative training occupies a large amount of time, the used training data determines the performance of the model to a certain extent, and a large amount of training data needs manual labeling.

A conditional random field model (CRF) is one of the supervised text extraction models, and is commonly used for labeling the part of speech of a word in a corpus (for example, labeling named entities or verbs, nouns and the like in the corpus). The CRF model has strong extraction capability on key information without obvious modes (specific rules are difficult to observe manually). However, the accuracy of the CRF model is not determined by itself, but mainly depends on whether the labeled corpus used for training is consistent with the target test corpus, more manual labeled corpora need to be prepared in advance, the extraction effect is unstable, the accuracy is difficult to predict, and the CRF model is not suitable for a scene with a strict requirement on the extraction accuracy.

Patent publication No. CN201910455064.3, keyword corpus annotation training extraction tool, discloses an annotation training tool capable of reducing the complexity of manual annotation process and improving efficiency and accuracy of mass keyword corpus annotation. The method comprises the following steps: the method comprises the steps that a keyword corpus labeling preparation module distinguishes mass corpus data from different sources, a semi-automatic corpus keyword labeling module creates a keyword labeling task, an adaptive algorithm is selected autonomously, automatic labeling based on an algorithm model is carried out, pre-labeling processing is carried out on text corpus data to be labeled through integrating at least one keyword extraction algorithm of CHI, LDA, TEXTRANK and TFIDF, labeling results of multiple algorithms are fused, and a feedback type keyword labeling model learning and training module trains a keyword labeling algorithm model after the labeling task is completed; and the keyword labeling model effect evaluation module automatically evaluates the quantitative labeling effect of the model index.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method for establishing a text extraction model based on regular expressions, and by compiling a small number of regular expressions to replace manual labeling, the labor cost and time required for establishing a supervised text extraction model are effectively reduced.

The technical scheme of the invention is as follows:

the first technical scheme is as follows:

a method for establishing a text extraction model based on a regular expression comprises the following steps:

s1, writing a plurality of regular expressions;

s2, extracting a corpus from the corpus according to the regular expression;

s3, dividing the corpus into a training set and a verification set;

s4, constructing a text extraction model;

s5, inputting the training set into a text extraction model, and training the text extraction model;

and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.

Further, the text extraction model is a CRF model.

Further, a threshold value is set in step S6; if the accuracy of the verification model is lower than the threshold, go to step S1.

The second technical scheme is as follows:

a regular expression based text extraction model building apparatus comprising a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of:

S1, writing a plurality of regular expressions;

s2, extracting a corpus from the corpus according to the regular expression;

s3, dividing the corpus into a training set and a verification set;

s4, constructing a text extraction model;

Further, the text extraction model is a CRF model.

The invention has the following beneficial effects:

1. according to the invention, manual marking is replaced by writing a small number of regular expressions, so that the labor cost and time required for establishing the model are effectively reduced.

2. The invention combines the advantages of the regular expression and the CRF model, can efficiently and accurately extract the key information in the text, and is specifically embodied in that:

based on the characteristics of the regular expression, the method has better effect in the field of processing the text with the fixed template, such as the auditing field and the patent field. Meanwhile, the text extraction model is used as an executor for extracting final text information, and is not limited by whether the information to be extracted has a strict template or not, and the extraction range is far higher than that of a method based on a 'regular expression', so that the method can be suitable for various fields.

3. The invention adds a small number of regular expressions, and then repeats steps S1-S6 to retrain the CRF model. The CRF model extraction effect can be effectively improved, and the rules compiled earlier can not be abandoned.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flowchart of a fourth embodiment.

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments.

Example one

Referring to fig. 1, a method for building a text extraction model based on a regular expression includes the following steps:

s1, writing a plurality of regular expressions;

s2, extracting a corpus from the corpus according to the regular expression;

s3, dividing the corpus set into a training set (80%) and a verification set (20%);

s4, constructing a text extraction model;

The beneficial effect of this embodiment lies in replacing artifical mark through writing a small amount of regular expressions, effectively reduces the human cost and the time that the model of establishment needs.

Example two

Further, the text extraction model is a CRF model.

In this example, the CRF model was constructed using an open-source "python-creating" development kit.

The embodiment has the advantages that the advantages of the regular expression and the CRF model are combined, the key information in the text can be efficiently and accurately extracted, and the method is specifically embodied in that:

EXAMPLE III

Further, a threshold (in this embodiment, the threshold is set to be 90%) is set on the CRF model, and if the model accuracy is lower than 90%, the step S1 is skipped.

The improvement of the present embodiment is that a small number of regular expressions are added, and the steps S1 to S6 are repeated to retrain the CRF model. The CRF model extraction effect can be effectively improved, and the rules compiled earlier can not be abandoned.

Example four

As shown in fig. 2, take an enterprise bidding specification as an example.

Setting the extraction targets as: pay-on-site address.

According to the extraction target, writing a regular expression: site pay-cost address (#. This regular expression can match text with the same "pattern" in the corpus, i.e.: xxxxx "is the address of the fee paid on-site.

And executing regular expression extraction, and extracting a corpus from the corpus. The corpus comprises matching texts and key field information in the matching texts. And (3) dividing the corpus set into a training set (80%) and a verification set (20%), and constructing a text extraction model.

And inputting the key field information in the training set and 30 words (obtained from the matching text) before and after the key field information into the CRF model for training. And verified through the verification set.

The finally obtained CRF model can not only extract a text containing a site payment address xxxxx, but also extract a text containing a bid address: xxxxxx "such sentences that do not conform to the written regular expression. This is because the CRF algorithm can make a judgment based on the context information (the first and second 30 words of the input are contexts), and make up for the deficiency of the regular expression.

EXAMPLE five

s1, writing a plurality of regular expressions;

s2, extracting a corpus from the corpus according to the regular expression;

s4, constructing a text extraction model;

EXAMPLE six

Further, the text extraction model is a CRF model.

EXAMPLE seven

Example eight

As shown in fig. 2, take an enterprise bidding specification as an example.

Setting the extraction targets as: pay-on-site address.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for establishing a text extraction model based on a regular expression is characterized by comprising the following steps:

S1, writing a plurality of regular expressions;

s2, extracting a corpus from the corpus according to each regular expression;

s3, dividing the corpus into a training set and a verification set;

s4, constructing a text extraction model;

2. The method of claim 1, wherein the text extraction model is a CRF model.

3. The method for building a text extraction model based on regular expressions according to claim 2, wherein a threshold is further set in step S6; if the accuracy of the verification model is lower than the threshold, go to step S1.

4. An apparatus for building a text extraction model based on regular expressions, comprising a memory and a processor, wherein the memory stores instructions adapted to be loaded by the processor and to perform a method for building a text extraction model based on regular expressions according to any one of claims 1 to 3.