CN109992761A

CN109992761A - The rule-based adaptive text information extracting method of one kind and software memory

Info

Publication number: CN109992761A
Application number: CN201910223558.9A
Authority: CN
Inventors: 李晓林; 李道庆; 张彦铎; 田英明; 刘玮; 姚峰; 范佳莹
Original assignee: SHANGHAI HUACHUAN ENVIRONMENTAL PROTECTION TECHNOLOGY Co Ltd; Wuhan Institute of Technology
Current assignee: SHANGHAI HUACHUAN ENVIRONMENTAL PROTECTION TECHNOLOGY Co Ltd; Wuhan Institute of Technology
Priority date: 2019-03-22
Filing date: 2019-03-22
Publication date: 2019-07-09

Abstract

The invention discloses a kind of rule-based adaptive text information extracting method and software memories, method includes the following steps: to the rule that the text object building text information of professional domain extracts, and rule is summarised in template.Template rule is handled by tree-shaped ordinal ranking, constitutes text template, template is four-layer structure, including section, row, sentence, word；It is for statistical analysis to text object to be extracted, representative keyword is preset, keyword is made of related term nothing to do with word；Information extraction is carried out to text to be extracted using the template of building, according to template four-layer structure sequence, text matches are carried out by keyword；Each level in template is filtered when there are multiple matching results with keyword, target information is accurately positioned；Output includes the Text Feature Extraction result of keyword.The present invention can adapt to the variation of content of text, structure, the extraction target text information of efficiently and accurately automatically.

Description

The rule-based adaptive text information extracting method of one kind and software memory

Technical field

The present invention relates to technical field of information processing, more particularly to a kind of rule-based adaptive text information extraction side Method and software memory.

Background technique

Currently, there is a large amount of valuable text informations, such as detailed coroner's court in the text of each professional domain Try the court's trial notes, ruling notes, conciliation notes etc. of situation.But it is of interest in legal documents to manually comb, extracting Content, especially processing magnanimity document when, expend a large amount of human and material resources, inefficiency.

Text extraction techniques at this stage, mainly for fixed structure text, extract text keyword, motif discovery or Short text adaptive information extraction etc..These methods are not appropriate for handling and are not fixed, need similar to textual forms such as court's trial notes Section sentence etc. is extracted compared with multi information, the longer text object of text length.

Summary of the invention

The technical problem to be solved in the present invention is that for the defects in the prior art, providing a kind of rule-based adaptive Answer text information extracting method and software memory.

The technical solution adopted by the present invention to solve the technical problems is:

The present invention provides a kind of rule-based adaptive text information extracting method, method includes the following steps:

Statistical Comparison, analysis and summary, the rule that building text information extracts are carried out to the text object of professional domain；

It for rule, is handled by tree-shaped ordinal ranking, constitutes an adaptive text template, template is according to professional domain It is different there are many classification, different classes of template corresponds to different classes of text object, and template is four-layer structure, including Section, row, sentence, word；

It is for statistical analysis to text object to be extracted, representative keyword is preset, keyword is by related term Nothing to do with word is constituted；

Information extraction is carried out to text to be extracted using the template of building, according to template four-layer structure sequence, passes through pass Keyword carries out text matches；

Each level in template is filtered when there are multiple matching results with keyword, mesh is accurately positioned Mark information；

Output includes the Text Feature Extraction result of keyword.

Further, it is matched with the paragraph rank of text object in this method of the invention when choosing keyword, then Extract the corresponding information of text fragment, including following rule:

It chooses keyword to be matched with the paragraph rank of text, obtains the corresponding text fragment content information of keyword；

More than two text fragment content informations are subjected to keyword filtering, obtain perfect copy paragraph location information.

Further, it when carrying out text matches in this method of the invention, is selected from preset each related term nothing to do with word Take corresponding with text fragment section, row, sentence, the matched text information of word, including following rule:

Obtain text fragment corresponding segment, row, sentence, multiple related terms of word, unrelated word；

According to the section of acquisition, row, sentence, word, multistage regular template is constructed；

Text object and template are subjected to matching comparison, extracted comprising related term in text object, but is not included unrelated The part of word；

The content of text information selected is determined as concern of the corresponding text object in regular template with location information Point.

Further, in this method of the invention according to text object by text be divided into four section, row, sentence, word grades Not, including it is following regular:

From the corresponding keyword of the different text types of determining in text object, including related term, unrelated word；

The keyword being collected into is subjected to template construction according to four ranks of section, row, sentence, word.

It further, further include the process for carrying out stencil-chosen when carrying out text information matching in this method of the invention, Its method particularly includes:

It is filtered out from preset template library according to text object type and the matched more than two moulds of text object type Plate；

It is selected from more than two templates and paragraph topic matching degree according to the corresponding paragraph topic of each text fragment Highest template, as the rule template for summarizing each text information.

The present invention provides a kind of software storage for being stored with the rule-based adaptive text information extracting method Device, the software in the software memory execute following procedure:

Output includes the Text Feature Extraction result of keyword.

The beneficial effect comprise that: rule-based adaptive text information extracting method of the invention and software Memory, constructs a kind of computation model of regularization, and model is divided into four levels: section, row, sentence, word according to requirement is extracted；And The variation that corresponding matching rule makes model adapt to content of text, structure automatically is defined on the basis of four levels, efficiently and accurately mentions Take target text information；This method has the advantage that 1, coverage area comprehensively, accurately；2, for constructing complete template, when When field changes, it is only necessary to change the templates content such as Feature Words, have no need to change stencil structure, greatly facilitate ordinary skill The work of personnel；3, in terms of big data processing, it can satisfy demand of the user to the information extraction of text in a large amount of fields.

Detailed description of the invention

Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:

Fig. 1 is the embodiment overall flow schematic diagram of the embodiment of the present invention；

Fig. 2 is the embodiment stencil structure schematic diagram of the embodiment of the present invention；

Fig. 3 is the adaptive text information model of embodiment of the embodiment of the present invention；

Fig. 4 is the partial information screenshot of the embodiment of the present invention；

Fig. 5 is the template partial information screenshot of the embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.

The rule-based adaptive text information extracting method of the embodiment of the present invention, method includes the following steps:

Output includes the Text Feature Extraction result of keyword.

In one particular embodiment of the present invention, rule-based adaptive text information extracting method, it is preferred that emphasis is Corresponding rule is inserted into respectively to four levels of building.

Rule one: for one attribute of category setting described in full wafer article, such as court's trial pen is respectively corresponded with 1,2,3 respectively Record, adjusts notes at ruling notes.Different templates is selected according to classification；

Rule two: level is progressive structure, is structure arranged side by side, the template of generally one tree structure between level；

Rule three: section, row, sentence level all include " id ", " Pos ", " name ", " title ", " mode ", " matches ", " This seven attributes of words ", wherein " Pos " is the Position Number of text, " name " is function name, and " title " is to extract content Referred to as, " mode " is extraction mode, and " matches " is the regular expression extracted, and " words " is Feature Words, anti-Feature Words.

Rule four: " Pos ", " name ", " title " these three attributes are combined and are identified to the function of each level, to mentioning The location of content taken sits record；

Rule five: the matching mould (be global registration or match for the first time) of " mode " decision " matches ", " Matches " is used for content of text in conjunction with regular expression, extracts the main information of concern；

Rule six: " words " records a certain amount of Feature Words and keyword, carries out to the result after " matches " matching Screening and filtering extracts more accurate text information；

Rule seven: it is arranged by the sequence of user demand and exports extraction result.

Above-mentioned rule can be divided into three parts according to function.Regular based on first and second rule, third to six is to extract rule Then, the 7th rule is tactical rule.

In another specific embodiment of the invention, information extraction is carried out mainly for the text object of court's trial record. As shown in Figure 1, rule-based adaptive text information extracting method, including the following steps:

Step1 obtains a certain amount of representative document object, determines document object structural model feature, improves document Adaptive ability；

Step2 formulates text information extracting rule one to six, and Rule content is flexibly made according to text object feature with demand It is fixed.Rule is converted to .json format, by the sequential organization of text, establishes matching stencil；

Step3 runs program, and makes the appropriate adjustments, match corresponding information and export to the rule three to six in template, By taking court mediation is put down as an example, extract information include law court, trial the time, plaintiff, defendant, the point at issue, court verdict and Trial personnel etc.；

Step4 analyzes the contents such as time, the trial type for extracting result, and is converted into program and is incorporated into output interface, It is exported according to rule seven by user demand in conjunction with " Pos " attribute for extracting text information.

As shown in Figure 2, the first row indicates article level attributes；Second row indicates each attribute of paragraph rank；The third line table Show each attribute of row rank；The attribute structure of fourth line expression keyword extraction.Regular template is constructed by tree structure.

As shown in Figure 3, left-half is by the template of rule building, and right half part is the target text divided by structure This.Intermediate four-headed arrow is meant to lay down a regulation according to text, and constructs template；Further according to the adaptive extraction text envelope of template Breath.

Specific steps:

Step 1: target information is determined according to base rule.It adjusts notes and writes law regulation, determine that law court adjusts notes book Form is write, puts down example in conjunction with reconciling, constructs regular template main body (as shown in Figure 2)；

Step 2: matching result is obtained according to extracting rule.It lays down a regulation in conjunction with regular expression, related term, unrelated word, Such as defendant's ^ (:)? | defendant's s+ (:)? | two case defendants, and construct regular template document (such as Fig. 5 stored with .json format It is shown).

Step 3: writing access, matcher, extract text information by regular template；

Step 4: target information is exported according to tactical rule.The result of extraction is arranged by sequences of text and exports (such as Fig. 4 It is shown).

It should be understood that for those of ordinary skills, it can be modified or changed according to the above description, And all these modifications and variations should all belong to the protection domain of appended claims of the present invention.

Claims

1. a kind of rule-based adaptive text information extracting method, which is characterized in that method includes the following steps:

For rule, handled by tree-shaped ordinal ranking, constitute an adaptive text template, template according to professional domain not With there are many classification, different classes of template corresponds to different classes of text object, and template is four-layer structure, including section, Row, sentence, word；

It is for statistical analysis to text object to be extracted, representative keyword is preset, keyword is by related term and nothing Word is closed to constitute；

Information extraction is carried out to text to be extracted using the template of building, according to template four-layer structure sequence, passes through keyword Carry out text matches；

Each level in template is filtered when there are multiple matching results with keyword, target letter is accurately positioned Breath；

Output includes the Text Feature Extraction result of keyword.

2. rule-based adaptive text information extracting method according to claim 1, which is characterized in that in this method It is matched when choosing keyword with the paragraph rank of text object, then extracts the corresponding information of text fragment, including as follows Rule:

3. rule-based adaptive text information extracting method according to claim 1, which is characterized in that in this method When carrying out text matches, corresponding with text fragment section, the matching of row, sentence, word are chosen from preset each related term nothing to do with word Text information, including following rule:

Text object and template are subjected to matching comparison, extracted comprising related term in text object, but does not include unrelated word Part；

The content of text information selected is determined as focus of the corresponding text object in regular template with location information.

4. rule-based adaptive text information extracting method according to claim 1, which is characterized in that in this method According to text object by text be divided into four section, row, sentence, word ranks, including following rule:

5. rule-based adaptive text information extracting method according to claim 1, which is characterized in that in this method It further include the process for carrying out stencil-chosen when carrying out text information matching, method particularly includes:

It is filtered out from preset template library according to text object type and the matched more than two templates of text object type；

It is selected from more than two templates and paragraph topic matching degree highest according to the corresponding paragraph topic of each text fragment Template, as the rule template for summarizing each text information.

6. a kind of software memory for being stored with the rule-based adaptive text information extracting method, feature exist In the software in the software memory executes following procedure:

Output includes the Text Feature Extraction result of keyword.