CN110059176A

CN110059176A - A kind of rule-based generic text information extracts and information generating method

Info

Publication number: CN110059176A
Application number: CN201910153119.5A
Authority: CN
Inventors: 骆斌; 卢坚; 伏晓
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2019-07-26
Anticipated expiration: 2039-02-28
Also published as: CN110059176B

Abstract

The present invention provides a kind of rule-based generic text information extraction and information generating methods, comprising: initialization information dictionary context, regular word packet, regulation engine and template engine；Information labeling is carried out to text；Define information extraction algorithm and redaction rule scripted code；Create-rule relies on digraph；It executes text decimation rule and is finely adjusted according to accuracy is extracted；It defines information and generates meta template；Self-defined template rule chooses and text generation.The present invention realizes decimation rule modularization, improve the shared possibility of decimation rule, good analysis mining can be carried out to the structure of complex text information, it greatly improves Extracting Information and external information generates the efficiency of text, the field of bulk information text progress information extraction and generation is needed especially suitable for legal documents etc..The method of the present invention can significantly improve text extraction efficiency and accuracy, optimization text extract complexity and improve information text formation efficiency.

Description

A kind of rule-based generic text information extracts and information generating method

Technical field

The present invention relates to soft projects and regulation engine field, are related to rule-based generic text information extraction and text Generation method is more specifically to be related to a kind of rule-based generic text information to extract and information generating method.

Background technique

With the raising of each information level of the enterprise, traditional data input has had preferable solution.And more More enterprise's problems faceds are: many information all derive from semi-structured text, this part text also lacks to be believed well Extracting tool is ceased, the typing mode of many information is all by being manually entered or complicated extraction logic is realized, manually at present Typing expends a large amount of manpower and material resources, ineffective, although and complicated extraction logic to extract accuracy rate higher, bring great number Maintenance cost, and the more difficult multiplexing of decimation rule, while whole process needs the longer period, is unfavorable for the quick friendship of software It pays.

Summary of the invention

It is extracted and text generation side to solve the above problems, the present invention provides a kind of rule-based generic text information Method realizes and carries out information labeling to text, while staff can be allowed efficiently to carry out writing for decimation rule, maximizes Ground is reached the multiplexing of decimation rule and is shared, and at the same time, can also integrate with third party's data source, it is good to generate format Text form good forward circulation convenient for the progress of extraction process.

In order to achieve the above object, the invention provides the following technical scheme:

A kind of rule-based generic text information extracts and document creation method, includes the following steps:

Step 1: initialization information dictionary context, regular word packet, regulation engine and template engine

Context of the initialization information dictionary as information extraction, for carrying out the letter of dynamic, expansion type to information text Breath extracts；Engine classification defined in configuration information is loaded, rule syntax resolver, grammer dependency parser and rule are carried out The load work of actuator；Initialization depends on regulation engine and supports the data access engine of third party's data source；Pass through Template engine configuration information is loaded, are generated by template and is added for defined precompile template instruction and the information write It carries, to complete the load work of entire template engine；

Step 2: information labeling is carried out to text information

Modeling analysis is carried out to Text Information Extraction, Text Information Extraction model is divided into monodrome information extraction and multilevel information It extracts；Monodrome information extraction indicates the text that single region content is extracted from one section of text；And multilevel information extracts expression The information of specified multiple regions is extracted from one section of text；Text information marking model includes: the range of text information label, mark Note information characteristics and information labeling identifier can find desired each information labeling from a segment information text Extract text；

Step 3: information extraction algorithm and redaction rule scripted code are defined

Analysis modeling is carried out to decimation rule, decimation rule model includes: scalar rule, shares rule, calculates without dependence Rule relies on computation rule and variable context rule；User carry out information extraction when, if current extraction item of information without according to Rely other rules also to rely on without apparent text context, is able to use scalar rule and carries out information extraction；If current extraction is believed The extraction mode for ceasing item is similar to other similar structure text, and extraction rule can be carried out by way of directly quoting or copying Then shared；If current extraction item of information, can be by calculating rule without dependence without other rules for relying on current rules context Then information is extracted；If current extraction item of information to current rules context have other rule dependence, can by according to Bad computation rule is calculated；If current extraction item of information has very deep structure to rely on, and the information of state does not need among it The extraction of display, then can carry out information extraction by variable context rule does not influence current rules context simultaneously；

Step 4: create-rule relies on digraph

Syntax parsing is carried out by the decimation rule write to user, exports the dependence item and its export item of the rule, it is raw Digraph is relied at rule；

Step 5: executing text decimation rule and is finely adjusted according to accuracy is extracted

This paper decimation rule is put into regulation engine and is executed, the extraction text of well-formed can be generated, by this Extracting Information and incipient text marking information carry out content comparison, and generate Extracting Information accuracy.

Step 6: it defines information and generates meta template

User can be directed to displaying demand, define information and generate meta template；It includes basic letter that information, which generates meta template, Informative text format and several regular filling regions；In order to provide general information generating mode, by providing self-defined information The mode that rule is expanded, user can import the information of third party's data source in a manner of meeting rule schemata；

Step 7: self-defined template rule chooses and text generation

Meta template is generated for same information, user can be by carrying out different information rule to several regular filling regions Selection then generates the text for adapting to different sub-scenes；User can select format to carry out information text generation.

Further, the step 1 includes the following steps:

Step 1-1: initial state；

Step 1-2: the data structure table of definition storage dictionary of information, the data structure of the dictionary of information are the Kazakhstan of hierarchical Uncommon table structure, can support the message structure of multi-layer；

Step 1-3: information load is carried out to dictionary of information according to hierarchical structure, root node is first loaded, then edge Hierarchical sequence until leaf node loaded；

Step 1-4: for each information subitem of dictionary of information, its corresponding information result is obtained, is needed first Leaf node is accessed, checks that it whether there is, if current leaf node exists, direct return information item result；Otherwise edge Hierarchical structure search upwards, until including the information subitem, then return information item result in some level of information；

Step 1-5: being loaded onto system for word packet from database, and a word packet includes single or multiple phrases and several Optional condition case statement, certain information extraction rules can include one or more word packet, pass through redaction rule script Word packet can be obtained；

Step 1-6: the associated condition discriminant function of word packet is loaded, for using when operation；For word packet, in advance Some condition discriminant functions are defined, can judge that some or certain words whether there is in word packet and whether is certain section of sentence Comprising the vocabulary in word packet, while user can be carrying out the expansion of conditional function to certain word packets from by way of expanding；

Step 1-7: by loading rule engine configuration information, regulation engine is initialized: selection regulation engine language Method collection carries out the load of grammar parser, then loads nonessential grammar contexts dependency parser for grammar parser, Finally regular actuator is loaded, completes the loading procedure of entire regulation engine；

Step 1-8: by loading template engine configuration information, template engine is initialized: selection template engine class Type is loaded template engine instruction set, and the information that system has defined is generated template and is loaded, entire mould is completed The loading procedure of plate engine；

Step 1-9: dictionary of information context, regular word packet and regulation engine are imported into algorithm, finally draw template It holds up and is integrated with algorithm, entire initial work finishes.

Further, the step 2 includes the following steps:

Step 2-1: initial state；

Step 2-2: user's selection needs to carry out the text of information extraction or by text import system to be extracted；

Step 2-3: user carries out the determination of extracting region by carrying out customized division to text；

Step 2-4: user adds the type of Extracting Information, monodrome or multivalue；

Step 2-5: if user selects monodrome type, text marking is carried out in specified text；

Step 2-6: if user select multivalue type, user it needs to be determined that text extracting region quantity, it is then right Specified text carries out selection mark；

Step 2-7: the secondary information labeling is named by user, and then system gives unique information flag identifier；

Step 2-8: text information mark is finished.

Further, the step 3 includes the following steps:

Step 3-1: initial state；

Step 3-2: user selects the information labeling carried out in step 2, then carries out specific information extraction rule Then write；

Step 3-3: user is extracted using several classes predetermined in regulation engine and is calculated in carrying out actual extraction process Method does not need specific rule and writes if algorithm extracts result satisfaction；

Step 3-4: otherwise, user needs to carry out writing for custom rule: user needs to conclude from text to be extracted Feature；

Step 3-5: morphological analysis is carried out to the decimation rule that user writes by rule syntax resolver, identification user exists The rule whether variable defined when redaction rule meets in specification and dependent Rule context whether there is；

Step 3-6: the morphological analysis sequence generated according to step 3-5 further carries out language by rule syntax analyzer Method is analyzed, and function defined in the rule write to user, program structure is analyzed, to the function wherein re-defined, no Correct program structure carries out error prompting；

Step 3-7: the rule script write to user is led by text information of the predefined function to extraction Out, export item is used to carry out to other rules of current rules context using in order to the extraction of structured text information；

Step 3-8: user can perform the following operation the rule write in rules context list: check in extraction Hold, analyzed using decimation rule and to decimation rule dependence；

Step 3-9: defining information extraction algorithm and redaction rule scripted code finishes.

Further, in the step 3-4, user's needs conclusion feature from text to be extracted specifically includes as follows Step: user can be from the keyword of context-free, and the specified modes such as phrase or regular expression carry out feature conclusion, Feature conclusion can be carried out from the context-sensitive mode containing specific semanteme.

Further, the step 4 includes the following steps:

Step 4-1: initial state；

Step 4-2: user can selectively carry out the analysis of regular dependence, if user selects to carry out dependency analysis, 7 are then entered step, otherwise enters step 3；

Step 4-3: after user selects regular dependency analysis, system carries out abstract syntax tree to rule by regulation engine Building；

Step 4-4: dependency parser analyzes rely on variable, local variable and rule in abstract syntax tree and exports item Content completes the analysis that rule relies on by carrying out deep search to abstract syntax tree；

Step 4-5: the dependency analysis digraph of generation is shown to user by dependency parser, and user passes through to interested Rule relies on item or rule export item is selected, and to check the in-degree relationship out of currentitem, understands the dependence of current rule Context；

Step 4-6: user can be directly entered the rule adjustment stage by the regularization term in selection digraph；

Step 4-7: create-rule relies on digraph and finishes.

Further, in the step 5, if text extracts accuracy and do not reach target, continue adjustment and owe accurate The decimation rule of degree reaches specified threshold value until extracting accuracy.

Further, the step 5 includes the following steps:

Step 5-1: initial state；

Step 5-2: user carries out regular execution to single rule, obtains after whole decimation rules is write, carries out The whole of rule execute；

Step 5-3: when user, which carries out rule, to be executed, according to defining information extraction algorithm and redaction rule foot in step 3 The Rule content of no grammer content mistake is put into regular actuator and is executed by the content in this code, system, according to matching The enforcement engine difference set uses corresponding enforcement engine mode；

Step 5-4: firstly, dictionary of information and word packet that rule relies on are put into rule by regular actuator executes context In, by the regular carry out sequence execution that will be needed to be implemented, for the rule that Mr. Yu's item executes, if currently performed rule There is the rule being also not carried out in the rule set relied on, then the rule being not carried out can be executed, be held until currently first The dependent Rule of capable rule, which is all performed, to be finished, the rule that then backtracking is not performed before having executed；

Step 5-5: if after rule has been finished, system can be by the corresponding text information mark of the rule and rule Export item compares, while calculating the accuracy of extraction, and the document markup information not hit is prompted；

Step 5-6: if the accuracy extracted has reached requirement, until step 5-7 is continued to execute, otherwise adjustment is regular Content continues to execute step 5-2；

Step 5-7: executing text decimation rule and is finely adjusted according to accuracy is extracted.

Further, the step 6 includes the following steps:

Step 6-1: initial state；

Step 6-2: user creates one and generates meta template with denominative information；

Step 6-3: user adds text essential information block, fixed dependent Rule item and placeholder in meta template；Text Essential information block is any text information；Fixed dependent Rule is the regularization term that certain seed type text extracts；Placeholder is directed to, When later period template generation, it is able to use text information and writes the regularization term in context with rule and be replaced；

Step 6-4: information is generated meta template and carries out database preservation by user；

Step 6-5: it defines information generation meta template and finishes.

Further, the step 7 includes the following steps:

Step 7-1: initial state；

Step 7-2: user selects already existing information to generate meta template and carries out text generation；

Step 7-3: user selects to carry out provisional version generation, or generates new self-defined template；

Step 7-4: user generates the placeholder in meta template to information and is replaced, and placeholder is common text envelope Breath, rule write the item of information in context or the regularization term in rules context；

Step 7-5: user selects the format of text generation, then downloads after filling the placeholder in template Generated text；

Step 7-6: the selection of self-defined template rule is finished with text generation.

Compared with prior art, the invention has the advantages that and the utility model has the advantages that

(1) traditional Text Information Extraction method has been carried out effective expansion by the present invention, so that entire extraction process is more Added with effect, more information contents are excavated from text information convenient for user；Rule-based carry out information extraction, by that will extract Content rule, so that rule is easier to be re-used and share, while assisting with dictionary of information and word packet, so that the process extracted It can dynamically be expanded, without repeatedly carrying out modification in logic；Logic refinement, which will be extracted, becomes rule Afterwards, by carrying out syntactic analysis to rule, the dependency analysis between rule visual expression has been subjected to, has made user clear The dependence flow direction of Extracting Information in current text is recognized on ground, efficiently provides the foundation to extract logic optimization and extraction process； Meta template is generated by introducing information, decimation rule and enterprise third party data source have been subjected to advantageous combination, allows text envelope Breath, which generates, becomes simple and efficient.

(2) compared with traditionally information extraction mode, the present invention realizes decimation rule modularization, improves decimation rule Shared possibility good analysis mining can be carried out to the structure of complex text information, together and by regular dependency analysis When template generated by self-defined information, greatly improve the efficiency that Extracting Information and external information generate text.The present invention The field of bulk information text progress information extraction and generation is needed especially suitable for legal documents etc..It facts have proved this method energy It enough significantly improves text extraction efficiency and accuracy, optimization text extracts complexity and improve information text formation efficiency.

Detailed description of the invention

Fig. 1 is the process of a kind of rule-based generic text information extraction and information generating method provided by the invention Figure.

Fig. 2 is structural schematic diagram of the invention.

Schematic diagram when Fig. 3 is operation of the invention.

Fig. 4 is regular dependency analysis algorithm flow chart of the invention.

Fig. 5 is that document of the invention marks schematic diagram.

Fig. 6 is the schematic diagram that practical decimation rule of the invention is write.

Fig. 7 is dependency analysis result schematic diagram of the invention.

Fig. 8 is meta template editor and text generation schematic diagram of the invention.

Fig. 9 is that information text of the invention generates result figure.

Specific embodiment

Technical solution provided by the invention is described in detail below with reference to specific embodiment, it should be understood that following specific Embodiment is only illustrative of the invention and is not intended to limit the scope of the invention.

Fig. 1 be flow chart of the present invention, generally comprise initial phase, information labeling stage, write the decimation rule stage with And edit model and text generation stage,

Wherein, in initial phase, mainly preconfigured dictionary of information, word packet are loaded onto from storages such as databases In system, while by the rule parsing device of regulation engine, dependency parser and regular actuator connection, regulation engine is completed Meta template, predefined template instruction and template engine are attached, complete the initial chemical industry of template engine by initial work Make.

In the information labeling stage, after the completion of initial work, user can choose the text for needing to carry out information extraction, so Extracting Information mark is carried out to the text afterwards, which can be by text marking at structure corresponding with subsequent redaction rule, side Continue the accuracy for improving and extracting after an action of the bowels.

The decimation rule stage is being write, when actually carrying out Text Information Extraction, is needing to carry out decimation rule analysis and builds Mould.The characteristics of user is according to Extracting Information carries out writing for decimation rule script.It the characteristics of for decimation rule, is segmented into Following several classifications:

A. scalar rule is then scalar also without text context is relied on if current extraction item of information is without other rules are relied on Rule.

B. share rule, if current extraction item of information is similar to other text structures, can by directly quoting or The mode of copy carries out the shared of decimation rule, referred to as shared rule.

C. without computation rule is relied on, if other rules of current extraction item of information without the current rules context of dependence, For without dependence computation rule.

D. rely on computation rule, if current extraction item of information rely on current rules context other rule, for according to Rely computation rule.

E. variable context rule, if current extraction item of information has very deep structure to rely on, and in-between state letter Breath does not need explicit extraction, then can carry out information extraction simultaneously by variable context rule does not influence in current rule Hereafter, referred to as variable context rule.

In edit model and text generation stage, user can be right by the template or selecting extraction rule of adding information Meta template is regenerated, while third-party data can also be introduced into text generation by third party's data source adapter In.

More specifically, as shown in Figure 1, rule-based generic text information provided by the invention extracts and information is raw At method, include the following steps:

For the information text of different field, the dictionary of information for thering is its field to frequently occur, by first in terms of this step 1 Context of the beginningization dictionary of information as information extraction can carry out the information extraction of dynamic, expansion type to information text；It is another Aspect, due to information text have in it the characteristics of, for the word of same meaning, different text writers, which can provide, to be contained Word similar in justice can improve the accuracy of information extraction by initializing regular word packet step by step.In regulation engine side Face carries out rule syntax resolver, grammer dependency parser and rule by engine classification defined in load configuration information The load work of actuator, in addition to this, depending on regulation engine, there are also support that the data access engine of third party's data source need to Carry out initial work.In terms of information generation, by loading template engine configuration information, to defined precompile mould The information that plate is instructed and write generates template and is loaded, to complete the load work of entire template engine.

This step includes following sub-step:

Step 1-1: initial state；

Step 1-3: information load is carried out to dictionary of information according to hierarchical structure, root node is first loaded, then edge Hierarchical sequence until leaf node loaded.

Step 1-5: being loaded onto system for word packet from database, and a word packet includes single or multiple phrases and several Optional condition case statement, certain information extraction rules may include one or more word packet, pass through redaction rule script Word packet can be obtained；

Step 1-6: the associated condition discriminant function of word packet is loaded, for using when operation.For word packet, in advance Define some condition discriminant functions, it can be determined that some or certain words whether there is in word packet and whether is certain section of sentence Comprising the vocabulary etc. in word packet, while user can also be carrying out conditional function to certain word packets and opening up from by way of expanding Exhibition；

Step 1-7: by loading rule engine configuration information, regulation engine is initialized.Select regulation engine language Method collection carries out the load of grammar parser, then loads nonessential grammar contexts dependency parser for grammar parser, Finally regular actuator is loaded, completes the loading procedure of entire regulation engine；

Step 1-8: by loading template engine configuration information, template engine is initialized.Select template engine class Type is loaded template engine instruction set, and the information that system has defined is generated template and is loaded, entire mould is completed The loading procedure of plate engine；

Step 2: information labeling is carried out to text information

This step is to carry out modeling analysis to Text Information Extraction, is guiding, the text with the purpose of Text Information Extraction Information Extraction Model is divided into monodrome information extraction and multilevel information extracts.Monodrome information extraction is to extract list from one section of text The text of a region content；And multilevel information extracts the information for then indicating that specified multiple regions are extracted from one section of text.Information Marking model includes the following contents: the range of text information label, markup information feature and information labeling identifier, for Each information labeling can find desired extraction text from a segment information text.

This step includes following sub-step:

Step 2-1: initial state；

Step 2-5:, can be by carrying out text marking in specified text if user selects monodrome type；

Step 2-8: text information mark is finished.

When actually carrying out Text Information Extraction, need to carry out analysis modeling to decimation rule.In order to carry out general text This information extraction, decimation rule model include: scalar rule, shared rule, without rely on computation rule, rely on computation rule and Variable context rule.These rules are write the decimation rule stage and are explained in detail above-mentioned.

This step includes following sub-step:

Step 3-1: initial state；

Step 3-3: several classes pumpings predetermined in regulation engine can be used in carrying out actual extraction process in user Algorithm is taken, if algorithm extracts result satisfaction, specific rule can not needed and write；

Step 3-4: otherwise, user needs to carry out writing for custom rule.User needs to conclude from text to be extracted Feature, can be from the keyword of context-free, and the specified modes such as phrase or regular expression carry out feature conclusion, can also be with Feature conclusion is carried out from the context-sensitive mode containing specific semanteme；

Step 3-5: for general rule script, the decimation rule that user writes can pass through rule syntax first Resolver carries out morphological analysis, this stage be mainly used for identifying the variable that is defined in redaction rule of user whether meet specification with And the rule in dependent Rule context whether there is；

Step 3-6: the morphological analysis sequence generated according to step 3-5 further carries out language by rule syntax analyzer Method analysis, function defined in the rule that this stage mainly writes user, program structure are analyzed, to wherein re-defining Function, incorrect program structure carry out error prompting；

Step 3-7: the rule script that user writes can be led by text information of the predefined function to extraction Out, export item can be carried out to other rules of current rules context using in order to the extraction of structured text information；

Step 4: create-rule relies on digraph

Syntax parsing is carried out by the decimation rule write to user, exports the dependence item and its export item of the rule, it can Digraph is relied on create-rule, user can be helped to optimize current extraction rule, user can also be helped to understand and worked as Preceding text structure.

This step includes following sub-step:

Step 4-1: initial state；

Step 4-2: user can be selectively performed the analysis of rule dependence, if user selects to carry out dependency analysis, 7 are then entered step, otherwise enters step 3；

Step 4-3: after user selects regular dependency analysis, system can carry out abstract syntax tree to rule by regulation engine Building；

Step 4-4: dependency parser can analyze dependence variable, local variable and rule export item in abstract syntax tree Content, by abstract syntax tree carry out deep search, complete rule rely on analysis；

Step 4-5: the dependency analysis digraph of generation can be shown to user by dependency parser, and user can be by sense The rule of interest relies on item or rule export item is selected, and to check the in-degree relationship out of currentitem, understands current rule Dependence context；

Step 4-6: user can also be directly entered the rule adjustment stage by the regularization term in selection digraph；

Step 4-7: create-rule relies on digraph and finishes.

It is executed by the way that this paper decimation rule to be put into regulation engine, the extraction text of well-formed can be generated, The Extracting Information and incipient text marking information can be subjected to content comparison, and generate Extracting Information accuracy.If Text extracts accuracy and does not reach target, can continue the decimation rule for adjusting deficient accuracy, reaches until extracting accuracy Specified threshold value.

This step includes following sub-step:

Step 5-1: initial state；

Step 5-2: user can carry out regular execution to single rule, and decimation rule that can also be whole, which is write, to be finished Afterwards, the whole of rule are carried out to execute；

Step 5-3: when user, which carries out rule, to be executed, information extraction algorithm and redaction rule script are defined according to step 3 The Rule content of no grammer content mistake can be put into regular actuator and be executed by the content in code, system, according to matching The enforcement engine set is different, and specific enforcement engine mode is also different；

Step 5-4: firstly, the dictionary of information and word packet that rule relies on can be put into rule by regular actuator executes context In, by the regular carry out sequence execution that will be needed to be implemented, for the rule that Mr. Yu's item executes, if currently performed rule There is the rule being also not carried out in the rule set relied on, then the rule being not carried out can be executed, be held until currently first The dependent Rule of capable rule, which is all performed, to be finished, the rule that then backtracking is not performed before having executed；

Step 5-6: if the accuracy extracted has reached requirement, until step 6, otherwise adjusts Rule content, continue Execute step 1；

Step 5-7: executing text decimation rule and is finely adjusted according to accuracy is extracted；

Step 6: it defines information and generates meta template

User can be directed to displaying demand, define information and generate meta template.It mainly includes basic that information, which generates meta template, Information text format and several regular filling regions.It is customized by providing in order to provide general information generating mode The mode that rule information is expanded, user can import the information of third party's data source in a manner of meeting rule schemata.

This step includes following sub-step:

Step 6-1: initial state；

Step 6-2: user can create one and generate meta template with denominative information；

Step 6-3: user can add text essential information block, fixed dependent Rule item and placeholder in meta template；

It is directed to text essential information block, text information can be arbitrary；

It is directed to fixed dependent Rule, can be the regularization term of certain seed type text extraction；

It is directed to placeholder, when later period template generation, can be used text information and rule writes rule in context Item is replaced with placeholder；

Step 6-5: it defines information generation meta template and finishes.

Step 7: self-defined template rule chooses and text generation

Meta template is generated for same information, user can be by carrying out different information rule to several regular filling regions Selection then generates the text for adapting to different sub-scenes.User can choose the format ultimately generated and carry out information text generation.

This step includes following sub-step:

Step 7-1: initial state；

Step 7-2: user can choose already existing information and generate meta template progress text generation；

Step 7-3: user, which can choose, carries out provisional version generation, and new self-defined template also can be generated；

Step 7-4: it is replaced firstly, user needs to generate the placeholder in meta template to information, placeholder can be Common text information, rule write the item of information in context or the regularization term in rules context；

Step 7-5: user can choose the format of text generation after filling the placeholder in template, including TXT, DOC, DOCX, PDF etc., then user can download generated text；

Fig. 2 is structural schematic diagram of the invention, schematic diagram when Fig. 3 is operation of the invention, wherein core of the invention knot Structure is exactly dynamic configuration information, prolongable regulation engine and efficient template engine.If for some area of text The Extracting Information in domain is changed, and user can be solved the problems, such as by addition dictionary of information context and expansion word packet, And changed if it is the decimation rule in some region of text and extraction content, user, which can also pass through, relies on digraph The dependence of current extraction is first cleared, then selection needs the rule modified to be adjusted and re-apply.It is past in enterprise Frequently change toward the requirement for text generation, and meta template design of the invention, certain items of information can be carried out Convenient replacement is redesigned with text filed, without any written in code, passes through online item of information and rule It then configures, the generation task of template can be completed.

Fig. 4 is the algorithm flow chart of the regular dependency parser of the present invention.After user has write rule, rule passes through rule After syntax analyzer, the abstract syntax tree there are also expression formula information is generated.It is available by being traversed to abstract syntax tree The dependence item and export item of rule.In this algorithm, need to identify the specific expression formula in syntax tree.Here in syntax tree Expression formula classify:

A. item associated expression, for property access expressions, variable expression and array expression are relied on.

B. item associated expression is exported, for method call expression formula.

C. local variable expression formula, for stating expression formula, assignment expression.

Algorithm steps are as follows.

Step 1: rule is generated into abstract syntax tree by grammar parser；

Step 2: abstract syntax tree is traversed；

Does step 3: expression formula traversal finish? if completed, step is arrived, otherwise enters step 4；

Does is step 4: current expression that local variable defines expression formula? if it is, local variable collection is added, enter Step 3,5 are otherwise entered step；

Does is step 5: current expression Attribute expression and attribute is not concentrated in local variable? if it is, plus Enter according to

Rely and concentrate, otherwise enters step 6；

Does is step 6: current expression method call expression formula and method call is export function? if it is, It is added

Definition collection, otherwise enters step 7；

Step 7: other expression formulas carry out recursive traversal, enter step 3；

Step 8: export relies on item and defined item；

Step 9: terminating.

Fig. 5 is text marking schematic diagram of the invention.After user imports text information, letter can be carried out to text information Breath mark.User can select text filed, can be by the way that title is arranged to region, for subsequent after selection Decimation rule write.After user is to text marking, it can be determined by carrying out text information to text marking segment Position.

Fig. 6 is the schematic diagram that practical decimation rule of the invention is write, and Fig. 7 is that dependency analysis result of the invention is illustrated Figure.User can practically write specific decimation rule after being labeled to text, carry out information extraction to text.User Text information can be carried out detailed by placement algorithm rule, the mode of keyword and regular expression or script edit It extracts on ground.At the same time, user can be checked by carrying out relying on digraph to text, analyze the dependence item of current rule with Export item is optimized and is adjusted to the rule currently write.

Fig. 8 is meta template editor and text generation schematic diagram of the invention, and Fig. 9 is that information text of the invention generates result Figure.User can enter text generation edit model module, pass through definition after the decimation rule to text information is write The template content of text generation and its decimation rule relied on, ultimately generate the information text of needs.

In conclusion traditional Text Information Extraction method has been carried out effective expansion by the present invention, so that entire extract Process is more efficient, excavates more information contents from text information convenient for user.And rule-based extraction script can be with Efficient multiplexing is carried out, and brings the promotion of maintenance aspect.By carrying out syntactic analysis to rule script, the present invention can be helped It helps user to understand the dependence of current various texts, makees good place mat for subsequent information extraction and extraction optimization.Except this In addition, in order to preferably be utilized Extracting Information, the invention proposes the concept of meta template, user can be by online may be used Mode depending on changing carries out the generation of text, greatly reduces the complexity of text generation, improves the efficiency of text generation.

The technical means disclosed in the embodiments of the present invention is not limited only to technological means disclosed in above embodiment, further includes Technical solution consisting of any combination of the above technical features.It should be pointed out that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

Claims

1. a kind of rule-based generic text information extracts and document creation method, which comprises the steps of:

Context of the initialization information dictionary as information extraction, for carrying out dynamic to information text, the information of expansion type is taken out It takes；Engine classification defined in configuration information is loaded, rule syntax resolver, grammer dependency parser and rule is carried out and executes The load work of device；Initialization depends on regulation engine and supports the data access engine of third party's data source；Pass through load Template engine configuration information generates template and loads to defined precompile template instruction and the information write, To complete the load work of entire template engine；

Step 2: information labeling is carried out to text information

Modeling analysis is carried out to Text Information Extraction, Text Information Extraction model is divided into monodrome information extraction and multilevel information is taken out It takes；Monodrome information extraction indicates the text that single region content is extracted from one section of text；And multilevel information extract indicate from The information of specified multiple regions is extracted in one section of text；Text information marking model includes: the range of text information label, mark Information characteristics and information labeling identifier can find desired pumping from a segment information text for each information labeling Take text；

Analysis modeling carried out to decimation rule, decimation rule model include: scalar rule, shared rule, without rely on computation rule, Rely on computation rule and variable context rule；User carry out information extraction when, if current extraction item of information without rely on its He also relies on without apparent text context rule, is able to use scalar rule and carries out information extraction；If current extraction item of information Extraction mode it is similar to other similar structure text, decimation rule can be carried out by way of directly quoting or copying It is shared；If current extraction item of information, can be by without dependence computation rule pair without other rules for relying on current rules context Information is extracted；If current extraction item of information has the dependence of other rules to current rules context, can be counted by relying on Rule is calculated to be calculated；If current extraction item of information has very deep structure to rely on, and the information of state does not need to show among it Extraction, then can carry out information extraction by variable context rule not influence current rules context simultaneously；

Step 4: create-rule relies on digraph

Syntax parsing is carried out by the decimation rule write to user, the dependence item and its export item of the rule is exported, generates rule Then rely on digraph；

This paper decimation rule is put into regulation engine and is executed, the extraction text of well-formed can be generated, by the extraction Information and incipient text marking information carry out content comparison, and generate Extracting Information accuracy；

Step 6: it defines information and generates meta template

User can be directed to displaying demand, define information and generate meta template；It includes basic information text that information, which generates meta template, This format and several regular filling regions；In order to provide general information generating mode, by providing self-defined information rule The mode of expansion, user can import the information of third party's data source in a manner of meeting rule schemata；

Step 7: self-defined template rule chooses and text generation

Meta template is generated for same information, user can be by carrying out different rule informations to several regular filling regions It chooses, generates the text for adapting to different sub-scenes；User can select format to carry out information text generation.

2. rule-based generic text information according to claim 1 extracts and document creation method, which is characterized in that The step 1 includes the following steps:

Step 1-1: initial state；

Step 1-2: the data structure table of definition storage dictionary of information, the data structure of the dictionary of information are the Hash table of hierarchical Structure can support the message structure of multi-layer；

Step 1-3: information load is carried out to dictionary of information according to hierarchical structure, root node is first loaded, then along layer Secondary sequence is until leaf node loaded；

Step 1-4: for each information subitem of dictionary of information, its corresponding information result is obtained, needs to access first Leaf node checks that it whether there is, if current leaf node exists, direct return information item result；Otherwise along layer Secondary structure is searched upwards, until including the information subitem, then return information item result in some level of information；

Step 1-5: being loaded onto system for word packet from database, and a word packet includes single or multiple phrases and several optional Condition case statement, certain information extraction rules can include one or more word packet, can by redaction rule script Word packet is obtained；

Step 1-6: the associated condition discriminant function of word packet is loaded, for using when operation；For word packet, predefine Some condition discriminant functions, can judge some or certain words whether there is in word packet and certain section of sentence whether include Vocabulary in word packet, while user can be carrying out the expansion of conditional function to certain word packets from by way of expanding；

Step 1-7: by loading rule engine configuration information, regulation engine is initialized: selection regulation engine grammer collection, The load of grammar parser is carried out, then loads nonessential grammar contexts dependency parser for grammar parser, finally Regular actuator is loaded, the loading procedure of entire regulation engine is completed；

Step 1-8: by load template engine configuration information, template engine is initialized: selection template engine type is right Template engine instruction set is loaded, and the information that system has defined is generated template and is loaded, entire template is completed and draws The loading procedure held up；

Step 1-9: importing algorithm for dictionary of information context, regular word packet and regulation engine, finally by template engine and Algorithm is integrated, and entire initial work finishes.

3. rule-based generic text information according to claim 1 extracts and document creation method, which is characterized in that The step 2 includes the following steps:

Step 2-1: initial state；

Step 2-6: if user selects multivalue type, user it needs to be determined that text extracting region quantity, then to specified Text carry out selection mark；

Step 2-8: text information mark is finished.

4. rule-based generic text information according to claim 1 extracts and document creation method, which is characterized in that The step 3 includes the following steps:

Step 3-1: initial state；

Step 3-2: user selects the information labeling carried out in step 2, then carries out specific information extraction rules volume It writes；

Step 3-3: user is in carrying out actual extraction process, using a few class extraction algorithms predetermined in regulation engine, If it is satisfied that algorithm extracts result, does not need specific rule and write；

Step 3-4: otherwise, user needs to carry out writing for custom rule: user needs to conclude feature from text to be extracted；

Step 3-5: morphological analysis is carried out to the decimation rule that user writes by rule syntax resolver, identification user is writing The rule whether variable defined when regular meets in specification and dependent Rule context whether there is；

Step 3-6: the morphological analysis sequence generated according to step 3-5 further carries out grammer point by rule syntax analyzer Analysis, function defined in the rule write to user, program structure are analyzed, to the function, incorrect wherein re-defined Program structure carry out error prompting；

Step 3-7: the rule script write to user is exported by text information of the predefined function to extraction, is led Item is used to carry out to other rules of current rules context using in order to the extraction of structured text information out；

Step 3-8: user can perform the following operation the rule write in rules context list: check extract content, It is analyzed using decimation rule and to decimation rule dependence；

5. rule-based generic text information according to claim 4 extracts and document creation method, which is characterized in that It is described

In step 3-4, the user needs to conclude feature from text to be extracted and specifically comprises the following steps: that user can be from upper Hereafter unrelated keyword, the specified modes such as phrase or regular expression carry out feature conclusion, also can be from context-sensitive Carry out feature conclusion containing specific semantic mode.

6. rule-based generic text information according to claim 1 extracts and document creation method, which is characterized in that The step 4 includes the following steps:

Step 4-1: initial state；

Step 4-2: user can selectively carry out the analysis of regular dependence, if user selects to carry out dependency analysis, into Enter step 7, otherwise enters step 3；

Step 4-3: after user selects regular dependency analysis, system carries out the structure of abstract syntax tree by regulation engine to rule It builds；

Step 4-4: dependency parser is analyzed in abstract syntax tree and relies on the interior of variable, local variable and rule export item Hold, by carrying out deep search to abstract syntax tree, completes the analysis that rule relies on；

Step 4-5: the dependency analysis digraph of generation is shown to user by dependency parser, and user passes through to interested rule It relies on item or regular export item is selected, to check the in-degree relationship out of currentitem, understand current regular dependence or more Text；

Step 4-7: create-rule relies on digraph and finishes.

7. rule-based generic text information according to claim 1 extracts and document creation method, which is characterized in that In the step 5, if text extracts accuracy and do not reach target, continue the decimation rule for adjusting deficient accuracy, until It extracts accuracy and reaches specified threshold value.

8. rule-based generic text information according to claim 7 extracts and document creation method, which is characterized in that The step 5 includes the following steps:

Step 5-1: initial state；

Step 5-2: user carries out regular execution to single rule, obtains after whole decimation rules is write, carries out rule Whole execute；

Step 5-3: when user, which carries out rule, to be executed, according to defining information extraction algorithm and redaction rule script generation in step 3 The Rule content of no grammer content mistake is put into regular actuator and is executed, according to configuration by the content in code, system Enforcement engine difference uses corresponding enforcement engine mode；

Step 5-4: it executes in context, leads to firstly, dictionary of information and word packet that rule relies on are put into rule by regular actuator Cross the regular carry out sequence execution that will be needed to be implemented, to Mr. Yu's item execute rule for, if it is currently performed rule institute according to There is the rule being also not carried out in bad rule set, then the rule being not carried out can be executed first, until currently performed The dependent Rule of rule, which is all performed, to be finished, the rule that then backtracking is not performed before having executed；

Step 5-5: if after rule has been finished, system can export the corresponding text information mark of the rule with rule Item compares, while calculating the accuracy of extraction, and the document markup information not hit is prompted；

Step 5-6: if the accuracy extracted has reached requirement, until step 5-7 is continued to execute, otherwise in adjustment rule Hold, continues to execute step 5-2；

9. rule-based generic text information according to claim 1 extracts and document creation method, which is characterized in that The step 6 includes the following steps:

Step 6-1: initial state；

Step 6-3: user adds text essential information block, fixed dependent Rule item and placeholder in meta template；Text is basic Block of information is any text information；Fixed dependent Rule is the regularization term that certain seed type text extracts；It is directed to placeholder, later period When template generation, it is able to use text information and writes the regularization term in context with rule and be replaced；

Step 6-5: it defines information generation meta template and finishes.

10. rule-based generic text information according to claim 1 extracts and document creation method, feature exist In the step 7 includes the following steps:

Step 7-1: initial state；

Step 7-4: user generates the placeholder in meta template to information and is replaced, and placeholder is common text information, rule Then write the item of information in context or the regularization term in rules context；

Step 7-5: user selects the format of text generation after filling the placeholder in template, and then downloading generates Good text；