Invention content
In order to solve the above technical problems, the application provides a kind of information extraction method, to reduce spent by structure rule
A large amount of manpower and time, information is more all-sidedly and accurately extracted from text.
In a first aspect, a kind of information extraction method is provided, including:It obtains the text of information to be extracted and extracts expression formula,
The extraction expression formula includes that region determines rule and information extraction rules, the region determine in rule to include Statistical Operator,
The Statistical Operator characterizes the statistical model of name entity and/or interdependent ingredient in text for identification;
Name entity and/or interdependent ingredient in the text are identified using statistical model, for the name entity identified
And/or interdependent ingredient marks corresponding identification label respectively;
The region is compared using the identification label and determines rule and the text, determines effective pumping in the text
Take region;
It is extracted and the matched character string of described information decimation rule from effective extraction region.
With reference to first aspect, include in described information decimation rule in the first possible realization method in first aspect
Statistical Operator;
The step of extracting character string matched with information extraction rules from effective extraction region, specifically includes:
Using the identification label, extracted and the matched word of described information decimation rule from effective extraction region
Symbol string.
The first realization method with reference to first aspect, in second of possible realization method of first aspect, the system
Meter model includes the first model of name entity and for identification the second model of interdependent ingredient for identification, the identification label
Including the first label and the second label;
If the region determines the statistics of Statistical Operator and characterization second model of the rule only including the first model of characterization
Any of operator, and described information decimation rule includes the second model of Statistical Operator and characterization for characterizing the first model
Another in Statistical Operator, then identify name entity and/or interdependent ingredient in the text with statistical model, for identification
The step of name entity and/or interdependent ingredient gone out marks corresponding identification label respectively, specifically includes:
Using name entity/interdependent ingredient in text described in the Model Identification of the first model/second, identified for each
Name entity/the first label of interdependent ingredient label/second label;
It is each using the interdependent ingredient/name entity effectively extracted described in the Model Identification of the second model/first in region
The label of a interdependent ingredient identified/the second label of name entity indicia/first.
With reference to first aspect and above-mentioned possible realization method, in first aspect in the third possible realization method, institute
The type for stating the first label includes name label, place name label and mechanism label, the type of second label include core at
Minute mark label, interdependent word label, agent ingredient label and word denoting the receiver of an action ingredient label;
The step of name the first label of entity indicia identified for each, including:
If being name, place name or mechanism using the name entity that first Model Identification goes out, identified to be described
The corresponding name label of name entity indicia, place name label or mechanism label;
The step of interdependent the second label of ingredient label identified for each, including:
If being core component, interdependent word, agent ingredient or word denoting the receiver of an action using the interdependent ingredient that second Model Identification goes out
Ingredient, then be the interdependent ingredient that identifies mark corresponding core component label, interdependent word label, agent ingredient label or
Word denoting the receiver of an action ingredient label;
The region is compared using the identification label and determines rule and the text, determines effective pumping in the text
The step of taking region, including:
It compares the region and determines rule and the text, wherein if the region determines the Statistical Operator in rule
The type of the label of the specified label carried and first label/second matches, then the Statistical Operator with described in label
The string matching of the label of first label/second, the specified label are used to characterize user and it is expected the life identified from text
The name type of entity or the type of interdependent ingredient;
The position of rule and the text matches is determined according to the region, is determined and is effectively extracted region.
With reference to first aspect and above-mentioned possible realization method, in the 4th kind of possible realization method of first aspect, profit
With the identification label, the step with the matched character string of described information decimation rule is extracted from effective extraction region
Suddenly, including:
Compare described information decimation rule and effective extraction region, wherein if in described information decimation rule
The type of the label of the specified label that Statistical Operator is carried and first label/second matches, then the Statistical Operator with
Mark the string matching of the label of first label/second;
It extracts and the matched character string of described information decimation rule.
With reference to first aspect and above-mentioned possible realization method, in the 4th kind of possible realization method of first aspect, institute
It states region and determines that rule further includes regular expression, wherein have successively between the Statistical Operator and the regular expression
Ordinal relation and/or logical operation relationship.
With reference to first aspect and above-mentioned possible realization method, in the 4th kind of possible realization method of first aspect, institute
It states region and determines that rule or described information decimation rule further include business factor concept/generic concept, the business factor concept/
The generic concept and the Statistical Operator, or between the regular expression there is sequencing relationship and/or logic to transport
Calculation relationship.
Second aspect provides a kind of information extraction method, including:
It obtains the text of information to be extracted and extracts expression formula, the extraction expression formula includes that region determines rule and information
Decimation rule includes Statistical Operator in described information decimation rule, the Statistical Operator characterization name in text for identification
The statistical model of entity and/or interdependent ingredient;
Name entity and/or interdependent ingredient in the text are identified using statistical model, for the name entity identified
And/or interdependent ingredient marks corresponding identification label respectively;
Determine that rule determines effective extraction region in the text using the region;
Using the identification label, extracted and the matched word of described information decimation rule from effective extraction region
Symbol string.
The third aspect provides a kind of information extraction device, including:
First acquisition unit, text and extraction expression formula, the extraction expression formula for obtaining information to be extracted include
Region determines rule and information extraction rules, the region determine in rule to include Statistical Operator, and the Statistical Operator characterization is used
In the statistical model for identifying name entity and/or interdependent ingredient in text;
First processing units are for identifying name entity and/or interdependent ingredient in the text using statistical model
The name entity and/or interdependent ingredient identified marks corresponding identification label respectively;Described in identification label comparison
Region determines rule and the text, determines effective extraction region in the text;And from effective extraction region
It extracts and the matched character string of described information decimation rule.
Fourth aspect provides a kind of information extraction device, including:
Second acquisition unit, text and extraction expression formula, the extraction expression formula for obtaining information to be extracted include
Region determines rule and information extraction rules, includes Statistical Operator in described information decimation rule, and the Statistical Operator characterization is used
In the statistical model for identifying name entity and/or interdependent ingredient in text;
Second processing unit is for identifying name entity and/or interdependent ingredient in the text using statistical model
The name entity and/or interdependent ingredient identified marks corresponding identification label respectively;Determine that rule determines using the region
Effective extraction region in the text;And using the identification label, extracted from effective extraction region and institute
State the matched character string of information extraction rules.
In the information extraction method of the application, the text of information to be extracted is obtained first and extracts expression formula, the extraction
Expression formula includes that region determines that rule and information extraction rules, the region determine in rule and/or described information decimation rule
Including Statistical Operator introduces to the statistical model of entity and/or interdependent ingredient be named to be defined as Statistical Operator for identification
To extracting in expression formula, obtain extracting expression formula.Then utilize statistical model identify name entity in the text and/or according to
It is saved as point, corresponding identification label is marked respectively for the name entity identified and/or interdependent ingredient.The identification is recycled to mark
Label compare the region and determine rule and the text, effective extraction region in the text are determined, from effective extraction
It is extracted in region and the matched character string of described information decimation rule;Alternatively, being determined described in regular determine using the region
Effective extraction region in text extracts from effective extraction region and is taken out with described information using the identification label
Take the character string of rule match.In this way, it calls in regular fashion and names entity and/or interdependent for identification
The statistical model of ingredient, it is very easy to use flexible during so that it is participated in extraction expression formula and text matches.With
Simple regular expression is compared, and the range of identification vocabulary is expanded, and can extract the information of user's needs more fully hereinafter,
It avoids expending a large amount of manpower and time when building regular expression simultaneously;With the simple method phase based on statistical model
Than can more accurately extract the information of user's needs.
Specific implementation mode
It elaborates below to embodiments herein.
In rule-based abstracting method, regular expression includes information extraction rules, and information extraction rules are used for
User is extracted in the text it is expected the information extracted.For example, by information extraction rules " of medium height | build is general " and text
This matching, it is such to describe that the information of build in the text when occurring " of medium height " or " build is general " in text
It is extracted.For more fully Extracting Information, modeling personnel need exhaustion one by one to go out all possible expression form to carry out structure
Regular expression is built, a large amount of manpower and time are expended.
Other than rule-based abstracting method, the abstracting method based on statistics can also be utilized come Extracting Information.I.e.
Statistical model, such as Hidden Markov Model are trained using user is marked wishing the language material of the information extracted first
(HMM), then maximum entropy model (MEMM), conditional random field models (CRF), supporting vector machine model (SVM) etc. utilize training
Good statistical model carrys out Extracting Information.Regular expression is built without the modeling personnel of profession using the abstracting method based on statistics
Formula, the manpower of saving and time.But compared with rule-based abstracting method, the abstracting method based on statistics on the whole
It is poor in terms of accuracy and accuracy.This is to be primarily due on the one hand, and training corpus is not comprehensive enough to answer statistical model
Accuracy impacts;On the other hand, when the extraction demand of user is more complex, statistical model is not only simply extracted
Whens name entity being good at etc., the accuracy based on the abstracting method of statistical model in application can also be affected.
For this purpose, the application proposes a kind of new information extraction method, entity and/or interdependent ingredient will be named for identification
Statistical model is defined as Statistical Operator, is introduced into regular expression, to obtain extracting expression formula.It adjusts in regular fashion
With the statistical model for naming entity and/or interdependent ingredient for identification, so that it is participated in the process with text matches, use non-
It is often convenient, flexible.Text is handled using the extraction expression formula, compared with simple regular expression, expands identification vocabulary
Range, can extract more fully hereinafter user needs information, while avoid build rule when expend a large amount of manpower and
Time;Compared with the simple method based on statistical model, the information of user's needs can be more accurately extracted.
Extraction expression formula in the application includes two parts:Region determines rule and information extraction rules.Characterization statistics mould
The Statistical Operator of type both can only be introduced into region and determine in rule, can also only be introduced into information extraction rules, can be with
Region is introduced into simultaneously to determine in rule and information extraction rules.Understand these three situations for the ease of clearly articulating, below
It will respectively be described by two embodiments:In one embodiment, region determines in rule to include Statistical Operator, and information is taken out
It takes and may include can not also including Statistical Operator in rule;In the second embodiment, it is calculated comprising statistics in information extraction rules
Son, region determine in rule may include can not also including Statistical Operator.
Referring to FIG. 1, in one embodiment, a kind of information extraction method is provided, the step of following S100-S400 is included
Suddenly.
S100:It obtains the text of information to be extracted and extracts expression formula, the extraction expression formula includes that region determines rule
And information extraction rules, the region determine in rule to include Statistical Operator, the Statistical Operator characterization is for identification in text
Name entity and/or interdependent ingredient statistical model.
In this application, the text of information to be extracted can be derived from the text of internet, can also be source Mr. Yu
The text etc. of a specific database, the application are not construed as limiting the source of the text of Extracting Information and form.
Name entity is that (named entity) is exactly name, mechanism name, place name and other are all with entitled mark
Entity, wider entity can also include number, date, currency, address etc..
Interdependent ingredient refers to the syntactic constituent that sentence is included, for example, core component, interdependent word, agent ingredient, word denoting the receiver of an action at
Grade.Exist in a sentence, between word and word with directive domination with the relationship dominated, the title being top dog
For the core component in governing word, that is, the application, it is in and is known as in dependent, that is, the application by ascendancy
Interdependent word.In general, center of the verb as sentence dominates the other compositions in sentence, that is to say, that these ingredients with
Various dependences are subordinated to verb, and this relationship is unidirectional.In addition to the syntaxes such as core component and interdependent word can be analyzed
Except ingredient, in a sentence, the semantic role for including predicate (verb or noun) can also be analyzed, such as Agent, by
Thing person etc., each semantic role are endowed certain semantic meaning, the Agent in sentence be exactly agent in the application at
Point, word denoting the receiver of an action person is exactly the word denoting the receiver of an action ingredient in the application.
Statistical Operator characterizes the statistical model of name entity and/or interdependent ingredient in text for identification, that is to say, that
The form of statistical model Statistical Operator is indicated, is extracted in expression formula consequently facilitating being applied to.Here statistical model is
Refer to and used the trained model of training corpus with mark, that is, the statistical model that parameter has determined.
It includes two parts to extract expression formula:Region determines rule and information extraction rules.Region determines rule in text
It is determined in this and effectively extracts region.In one embodiment, region determines that rule may include preposition locating rule and postposition
Locating rule, preposition locating rule for determining initial position in the text, and postposition locating rule for determining knot in the text
Beam position.After initial position and end position is determined, text between the two is effectively to extract region.In this feelings
Under condition, preposition locating rule and postposition locating rule are both at least one comprising Statistical Operator, are considered as in locating rule
Contain Statistical Operator.In another embodiment, region determines that rule may include centralized positioning rule, centralized positioning rule
It is then used to determine center in the text, then according to center toward context extension predeterminable area, so that it is determined that effectively
Extract region.
For the region that one includes Statistical Operator determines rule, it can only include Statistical Operator, can also wrap
Containing Statistical Operator and regular expression.When including Statistical Operator and when regular expression, there is therebetween sequencing relationship
And/or logical operation relationship.For example, preposition locating rule can be the form of " ten points of beauties of PD ", wherein PD is Entity recognition
Operator indicates the statistical model for naming entity for identification, and " very beautiful " is regular expression.In this example, it counts
Operator and regular expressions have sequencing relationship between being.The rule can be matched to similar " Wang Erni is very beautiful " in this way
Character string.In another example preposition locating rule can also be form as " PD (very beautiful | very beautiful) ", wherein " very
It is beautiful ", " very beautiful " be all regular expression, there is therebetween logical operation relationship "or", and PD and (very beautiful | very beautiful)
There is sequencing relationship between this entirety.That is, between Statistical Operator and Statistical Operator, regular expression and canonical
There can be sequencing relationship between expression formula or between Statistical Operator and regular expression and/or logical operation is closed
System.
Information extraction rules, which are used to extract user in effectively extracting region, it is expected the information extracted.In the present embodiment
In, information extraction rules can only include regular expression, can also include regular expression and Statistical Operator.When including canonical
When expression formula and Statistical Operator, there is therebetween sequencing relationship and/or logical operation relationship, and it is above-mentioned similar, this
Place repeats no more.
In one embodiment, information extraction rules can between preposition locating rule and postposition locating rule, two
Separated with "@" between two, that is, it is " preposition locating rule@information extraction rules@postpositions locating rule " to extract expression formula.Here,
Preposition locating rule or postposition locating rule can be sky.When preposition locating rule is empty, then it is defaulted as with entire chapter text
First character is initial position;When postposition locating rule is empty, then be defaulted as be with the last character of entire chapter text
End position.Preposition locating rule or postposition locating rule, which are empty situation, can be considered as the locating rule only comprising regular expressions
A kind of special circumstances of formula.
Optionally, statistical model includes the first model of name entity and for identification the second of interdependent ingredient for identification
Model.Statistical Operator includes Entity recognition operator PD and interdependent ingredient operator DC, and Entity recognition operator PD characterizes the first model, according to
It is saved as a point operator DC and characterizes the second model.
HMM model, CRF models etc. may be used in first model.In the trained stage, instructed using the language material with mark
Practice, the major parameter of model is determined, to obtain trained first model.In service stage, by text to be identified
It is input in statistical model, so that it may to export the name entity in the text to be identified.Similarly, the second model can also
It is trained using HMM model, MEMM models, CRF models etc., the language material for the band mark that only training uses and training first
The difference of model, so the obtained major parameter of model of training is also different to get to the second different models.For first,
The training of second model, if being used as training corpus using the language material of the band mark under different application scenarios, trained
The model parameter arrived will be different, so that trained statistical model can be more applicable under processing specific application scene
Language material.For example, if training corpus is all the finance and economic news marked, trained statistical model can be more applicable for
Finance and economic news is handled, i.e., identifies name entity or interdependent ingredient from finance and economic news.Specific training statistical model
Method in the prior art may be used in method, and details are not described herein again.
For example, it is " PD@(of medium height | build is general)@" to extract expression formula 1.In this example, preposition positioning
Rule only includes Statistical Operator PD, when preposition locating rule is matched with text, will call corresponding statistics
Model identifies the name entity in text, is just initial position by the location determination if recognize name entity;Postposition is fixed
Position rule is sky, that is, using the last character of text as end position;Information extraction rules be " (it is of medium height | build
Generally) ", that is to say, that in effectively extracting region, if including " of medium height " or " build is general ", is just extracted
Come.
S200:Name entity and/or interdependent ingredient in the text are identified using statistical model, for the name identified
Entity and/or interdependent ingredient mark corresponding identification label respectively.
In the S200 the step of, name entity and/or interdependent ingredient in the text are identified using statistical model, it will be literary
This input data as statistical model, output number of the name entity and/or interdependent ingredient that will identify that as statistical model
According to using method in the prior art progress, details are not described herein again.Name entity that each in text identifies and/
Or interdependent ingredient all corresponding identification labels of label one respectively.
When extract contain a variety of Statistical Operators in expression formula when, the name that can be identified for each statistical model
Entity or interdependent ingredient mark corresponding different identification label.Such as in one embodiment, table 1 is can refer to, statistics is calculated
Son may include Entity recognition operator and interdependent ingredient operator, be indicated respectively with " PD " and " DC ".Statistical model may include
First model of name entity and for identification the second model of interdependent ingredient for identification, Entity recognition operator characterize the first mould
Type, interdependent ingredient operator characterize the second model.It identifies that label includes the first label and the second label, is gone out using the first Model Identification
Content all be name entity, for its first label of label, the content gone out using the second Model Identification is all interdependent ingredient, be its
The second label of label.
The correspondence example one of more than 1 a Statistical Operator of table, multiple statistical models and multiple identification labels
S300:The region is compared using the identification label and determines rule and the text, is determined in the text
Effectively extract region.
Since region determines rule, there are a variety of specific implementation forms, it is thus determined that effectively extracting the specific steps in region
There are little bit differents.For example, in one implementation, it includes preposition locating rule and postposition positioning rule that region, which determines that rule is,
Rule then, at least one of preposition locating rule and postposition locating rule include Statistical Operator in the present embodiment.Pass through
Preposition locating rule is assured that initial position with the text matches, passes through postposition locating rule and text matches
Determine end position.After initial position and end position is determined, text between the two is effectively to extract region.Again
For example, in another implementation, region determines that rule includes centralized positioning rule, and centralized positioning rule includes that statistics is calculated
Son is assured that center using centralized positioning rule and text matches, then according to center toward context extension
Predeterminable area, so that it is determined that effectively extracting region.But either any way of realization, in preposition locating rule and/or postposition
When matching is compared with text in locating rule or centralized positioning rule, be required for using corresponding identification label come
Judge whether preposition locating rule, postposition locating rule or centralized positioning rule match with text.
Specifically, if in preposition locating rule, postposition locating rule or centralized positioning rule only including Statistical Operator,
As long as the corresponding statistical model of so one identification label and the statistical model of Statistical Operator characterization are the same models, so that it may with
Think that the character string (name entity or interdependent ingredient) of identification label label is matched with the Statistical Operator in the rule.Due to preceding
It only includes Statistical Operator to set in locating rule, postposition locating rule or centralized positioning rule, so the rule and the identification label
The string matching of label.If in preposition locating rule, postposition locating rule or centralized positioning rule both having included Statistical Operator
Include regular expression again, then rule need on the whole with text matches, that is to say, that character string in text is in addition to needing
Other than being matched with Statistical Operator, it is also required to and Statistical Operator in rule with the character string before and after the matched character string of Statistical Operator
Front and back regular expression matching, i.e. Statistical Operator, regular expression and having therebetween and/or is patrolled at sequencing relationship
Volume operation relation can in text matches.
For example, it is assumed that preposition locating rule is " PD (very beautiful | very beautiful) ", then the rule can match in text
Character string " Wang Erni is very beautiful ", " Li Ermei is very beautiful " etc., but can not match that similar " Wang Erni is very beautiful ", " king two
The younger sister of girl is very beautiful " as character string.
S400:It is extracted and the matched character string of information extraction rules from effective extraction region.
In one implementation, it only contains regular expression in information extraction rules, does not include Statistical Operator, then it is sharp
It is matched with regular expression with the word in effective extraction region, so that it may to extract and the matched word of information extraction rules
Symbol string, that is, user it is expected the information extracted from text.
Illustrated below with example.
The text 1 of information to be extracted:
Wang Erzhu is general with precursor type, he took regular exercise daily later, now very strong strong.
Extract expression formula 1:PD@(it is of medium height | build is general)@
The Statistical Operator PD in rule is determined using region, that is, the statistical model identification that Entity recognition operator is characterized
It is name entity to go out in text 1 " Wang Erzhu ", is " Wang Erzhu " marker recognition label --- the first label in text 1.Then
Region is determined that the preposition locating rule " PD " in rule is compared with text 1, due to marked on " Wang Erzhu " in text 1 the
Statistical model and the statistical models characterized of PD in preposition locating rule " PD " corresponding to one label are the same models, therefore
" Wang Erzhu " is matched with preposition locating rule " PD ", is initial position by the location determination of " Wang Erzhu " in text 1.Postposition is fixed
Position rule is sky, so the last character of text 1 is determined as end position.It is effectively extracted so that it is determined that going out in text 1
Region 1 is that " general with precursor type, he adhered to body-building daily later, now very strong strong.”.
In this example, if the postposition locating rule of expression formula to be extracted 1 replaces with " (DC) { 0,10 } is strong ",
Middle DC is interdependent ingredient operator, and the statistical model characterized using the operator can identify the interdependent ingredient in text;“.{0,
10 } strong " it is regular expression.If there are an interdependent ingredients in text, and 0-10 word after the interdependent ingredient
" strong " the two characters are contained in symbol, mean that the interdependent ingredient to strong this section of character string and the postposition locating rule
Match.It applies in text 1, the statistical model characterized using DC can identify " exercise " this interdependent ingredient, be it
Mark corresponding identification label --- the second label.The statistical model characterized by the corresponding statistical model of the second label and DC
It is same, therefore the character string " exercise " that the second label is marked can be matched with " (DC) " in the postposition locating rule.
It has been matched to " strong " again in 10 characters after " exercise ", so " taking exercise, now very strong strong " this character
String is just matched with the postposition locating rule.So that it is determined that go out in text 1 effectively to extract region 2 be " it is general with precursor type, later he
Adhere to daily ".
Information extraction rules are " of medium height | build is general ", therefore can extract " body in effectively extracting region 1
Type is general " this character string.
Optionally, include Statistical Operator referring to FIG. 2, in another realization method of S400, in information extraction rules,
I.e. as shown in step S101 in Fig. 2, then extracted and the matched character string of information extraction rules from effective extraction region
The step of, it specifically includes:
S410:Using the identification label, extracted and described information decimation rule from effective extraction region
The character string matched.
Here, when including Statistical Operator in information extraction rules, with effective mistake for extracting the string matching in region
Journey with the step of aforementioned S300 in preposition locating rule etc. with text to carry out matched process similar.If in information extraction rules
Only include Statistical Operator, as long as then the statistical model that the corresponding statistical model of identification label is characterized with Statistical Operator is same
Model, so that it may to think that the character string (name entity or interdependent ingredient) of identification label label is matched with Statistical Operator.Due to
Only include Statistical Operator in information extraction rules, so the character string and the rule match of identification label label, by the character
String extracts.If not only having included Statistical Operator in information extraction rules but also having included regular expression, which needs
On the whole with text matches, that is to say, that character string in text is calculated other than needing to match with Statistical Operator with statistics
Character string before and after the matched character string of son is also required to calculate with the regular expression matching before and after Statistical Operator in rule, i.e. statistics
Son, regular expression and have therebetween sequencing relationship and/or logical operation relationship can in text matches,
Matched character string can just be extracted as the information extracted from text.
Still the example of text 1 above-mentioned is continued to use, extracting expression formula 2 is:PD@(it is of medium height | build is general) { 0,10 }
DC@。
By preposition locating rule and after be appointed as rule, it may be determined that effectively extract region 1 be " it is general with precursor type,
He took regular exercise daily later, now very strong strong.”.
It can identify in text 1 that " exercise " is interdependent ingredient using the interdependent ingredient operator DC statistical models characterized,
For its second label of label.Replace " exercise " effectively extracted in region with the second label, due to the corresponding system of the second label
It is same that model, which is counted, with the statistical model that DC is characterized, therefore the character string " exercise " of the second label label is positioned with the postposition
" DC " in rule can be matched.It is matched to " build is general " in 0-10 character before " exercise ", so " build one
As, he took regular exercise daily later " this character string just match with the information extraction rules, it taken out from effectively extraction region
It takes out.
By the above method, rule is combined with statistics, neatly can determine that rule and/or information are taken out in region
It takes and calls statistical model in rule, can also be combined with regular expression, obtain the abundanter extraction expression formula of form.
Information is extracted using the extraction expression formula, compared with simple regular expression, expands the range of identification vocabulary, Ke Yigeng
Add the information for comprehensively extracting user's needs, while avoiding expending a large amount of manpower and time when building rule;With it is simple
The method based on statistical model compare, can more accurately extract user needs information.
When one, which is extracted, contains a variety of Statistical Operators in expression formula, the statistics that these Statistical Operators can be characterized
Model marks corresponding identification label respectively respectively all to the text identification one time for the content identified.In addition, optional
Ground, please refers to Fig.3 and Fig. 4, if the region determines rule only include the first model of characterization Statistical Operator and characterization second
Any of Statistical Operator of model, and described information decimation rule includes the Statistical Operator and characterization for characterizing the first model
Another in the Statistical Operator of second model, then the step of S200 includes:
S201:Using the name entity in text described in the first Model Identification, the name entity mark identified for each
Remember the first label;
S202:Using the interdependent ingredient effectively extracted described in the second Model Identification in region, for each identify according to
It is saved as minute mark and remembers the second label.
Alternatively, the step of S200, includes:
S203:Using the interdependent ingredient in text described in the second Model Identification, identified for each interdependent at minute mark
Remember the second label;
S204:Using the name entity effectively extracted described in the first Model Identification in region, the life identified for each
Name the first label of entity indicia.
In the case of containing other Statistical Operators for not including in region determination rule in information extraction rules, nothing
The statistical model that be characterized of all Statistical Operators extracted in expression formula need to be all used for text identification one time, but can be first
Determine that text identification one time, region is effectively extracted determining for the statistical model that is characterized of Statistical Operator in rule using region
After, then with including in information extraction rules, region determine the statistical model that the Statistical Operator for not including in rule is characterized,
To effectively extracting region recognition one time, to reduce the length of the text identified required for the statistical model of part, identification speed is promoted
Degree, and then promote information extraction speed.
Herein it should be noted that in the realization method, the step of S202 and S204 logically known to be in S300
The step of after.In this application, the number of step is intended merely to facilitate description, is not used in each step in restriction method
Sequentially, each step in method, as long as in logic rationally, the sequencing of execution can change.
Optionally, as shown in table 2, the type of first label may include name label, place name label and mechanism mark
Label, the type of second label includes core component label, interdependent word label, agent ingredient label and word denoting the receiver of an action ingredient label.
In step S201 and/or step S204, the step of name the first label of entity indicia identified for each,
Including:
If being name, place name or mechanism using the name entity that first Model Identification goes out, identified to be described
The corresponding name label of name entity indicia, place name label or mechanism label.
In step S202 and/or step S203, the step of interdependent the second label of ingredient label identified for each,
Including:
If being core component, interdependent word, agent ingredient or word denoting the receiver of an action using the interdependent ingredient that second Model Identification goes out
Ingredient, then be the interdependent ingredient that identifies mark corresponding core component label, interdependent word label, agent ingredient label or
Word denoting the receiver of an action ingredient label.
If region determines that rule includes Statistical Operator, referring to FIG. 5, the step of S300 may include:
S301:It compares the region and determines rule and the text, wherein if the region determines the statistics in rule
The type of the label of the specified label that operator is carried and first label/second matches, then the Statistical Operator and label
The string matching of the label of first label/second, the specified label it is expected to identify from text for characterizing user
Name entity type or interdependent ingredient type;
S302:The position of rule and the text matches is determined according to the region, is determined and is effectively extracted region.
If information extraction rules include Statistical Operator, referring to FIG. 6, the step of S410 may include:
S411:Compare described information decimation rule and effective extraction region, wherein if described information decimation rule
In the specified label that is carried of Statistical Operator and first label/second the type of label match, then the statistics is calculated
Son and the string matching for marking the label of first label/second;
S412:It extracts and the matched character string of described information decimation rule.
The correspondence example two of more than 2 a Statistical Operator of table, multiple statistical models and multiple identification labels
Either region determines the Statistical Operator in rule or information extraction rules, can be to statistics mould that it is characterized
The name entity that type is identified/interdependent ingredient is further classified, and upper different types of label is accordingly marked, to more
Accurately it is expected the information extracted from text to limit user by extracting the specified label of Statistical Operator in expression formula, subtracts
Few situation for extracting mistake, improves the accuracy of the information extracted.
For example, information extraction rules a is " PD { 0,2 } is very beautiful ", information extraction rules b is " (PD_PER) { 0,2 } ten
Divide beauty ".
Effective extraction region of text 2 is:Henan Province is very beautiful, and the Wang Erni for being born in Henan Province is also very beautiful.
It is effectively extracted in region come the first Model Identification when extracting, characterized using PD using information extraction rules a
Name entity " Henan Province ", " Henan Province ", " Wang Erni ", respectively the first label on the two string tokens.It then will letter
Breath decimation rule a is matched with effective extraction region of text 2, corresponding to the first label in first " Henan Province "
Statistical model be the first model, characterized with PD it is identical, so first " Henan Province " can be matched with PD.First
" Henan Province " below exist " very beautiful ", and therebetween between be divided into 0 character, can in information extraction rules a
Regular expression " { 0,2 } is very beautiful " match.Therefore, " Henan Province is very beautiful " matches with rule a, can be taken out
It takes out.
Similar, second " Henan Province " can also be matched with PD, although being contained " very in character string thereafter
It is beautiful ", the interval between second " Henan Province " has but been more than 2 characters, so can not match.
Similar, " Wang Erni " can also be matched with PD, and character string " very beautiful " thereafter and " Wang Erni " it
Between between be divided into 1 character, can be matched with the regular expression " { 0,2 } is very beautiful " in information extraction rules a.Cause
This, " Wang Erni is also very beautiful " matches with rule a, can be extracted.
It is effectively extracted in region come the first Model Identification when extracting, characterized using PD using information extraction rules b
Entity " Henan Province ", " Henan Province ", " Wang Erni " are named, since two " Henan Province " are place name, so mark ground respectively for it
Name label;" Wang Erni " is name, and name label is marked for it.
Then information extraction rules b is matched with effective extraction region of text 2, although first and second
The statistical model corresponding to place name label in " Henan Province " is also the first model, is characterized with PD identical, but is united in rule b
It is " _ PER " to calculate the specified label that son is carried, i.e., user it is expected that the type of the name entity identified from text is name
This type cannot be matched with place name label, so the Statistical Operator " (PD_PER) " in rule b and two " Henan Province " are not
It can match.
Similar, for " Wang Erni ", label behaviour name label, with Statistical Operator " (PD_PER) " energy in regular b
Enough match.Also, " Wang Erni " subsequent " also very beautiful " regular expression " { 0,2 } is very beautiful " with rule b
Also it can match, so " Wang Erni " is also very beautiful " matched with rule b, it can be extracted.
As can be seen that extracted using information extraction rules a, it can extract that " Henan Province is very in region from effective extract
Beauty ", " Wang Erni is also very beautiful " two character strings.Extracted using information extraction rules b, then only can from it is same effectively
It extracts in region and extracts " Wang Erni is also very beautiful " this character string.
Optionally, either region determine rule or information extraction rules in, can include business factor concept and/
Or generic concept.
In this application, generic concept refers to the word sense information and word of the vocabulary unrelated with specific business in text
Semantic relevance between remittance.One generic concept can represent one group of vocabulary, can also indicate in short.Generic concept is pair
The description of object reflects the abstract expression, such as time, place, mood, evaluation of essential attribute etc. of its described object.It is logical
It can be often multiplexed in different fields, different application scenarios with concept.Generic concept can use " c " to indicate.
For example, for a generic concept " negative ", i.e., " c_ negatives ", it can represent " not ", " not having ", " never " etc.
Vocabulary.That is, include " not " in the text, any of " not having " and " never ", it is considered as in the text
The vocabulary this generic concept is matched with " c_ negatives ".
In another example for a generic concept " discontented ", i.e., " c_ is discontented ", it can be indicated " [^ is not] { 0,5 } is discontented ".
Wherein, " [^ is not] { 0,5 } is discontented " indicates, in matched text, as long as including the text of 0~5 character before " discontented ", all
It can be matched by " [^ is not] { 0,5 } is discontented ", such as " very discontented " etc., while it is anti-to exclude " not being discontented ", " not counting discontented " etc.
To semantic sentence.Therefore, if including the text of 0~5 character in a text before " discontented ", and this 0~5 word
Fu Zhongwei includes " no ", then it is assumed that this generic concept is matched to this character string with " c_ is discontented ".
Business factor concept refers to the semantic association between the semantic information and vocabulary of vocabulary related with specific business
Property.Similar with generic concept, business factor concept can also represent one group of vocabulary, can also indicate in short.Business factor
Concept is the description pair with the relevant object of business or its attribute, often related from field, different business, in different fields
Or it cannot be multiplexed under different application scenarios.Business factor concept can use " e " to indicate.
For example, in bank card customer service field, business factor concept " puppet emits information ", i.e. " e_ puppets emit information " can be with
Represent the vocabulary such as " puppet emits short message ", " puppet emits message ", " puppet emits incoming call ", " puppet emits mail ".When including that " puppet emits in a text
Any of short message ", " puppet emits message ", " puppet emits incoming call " and " puppet emits mail " mean that the vocabulary and " e_ in the text
Puppet emits information " this generic concept is matched.
Semantic model refers to towards known concept, and what conclusion exhaustion went out from sample data is used to describe known concept semanteme
Text presentation form.By multiple generic concepts unrelated with business, is organized with tree structure, just constitute conceptional tree.
One conceptional tree can be understood from being a semantic model.By the multiple and relevant business factor concept of business, with tree structure
It organizes, just constitutes element tree.One element tree also is understood as being a semantic model.Utilize such semantic model
Text can be identified, determine in text and whether there is and the generic concept or business factor concept matching in semantic model
Character string.
In the scheme of the application, such generic concept or business factor concept can be also introduced into rule, with
The abundanter extraction expression formula of the form of the composition, to accurately comprehensively extract information.
Since in the present embodiment, region determines that rule includes Statistical Operator, so also include when region determines in rule
When business factor concept and/or generic concept, with similar, Statistical Operator and business factor the case where also including regular expression
Also there is sequencing relationship and/or logical operation relationship between concept and/or generic concept.In addition, region determines in rule
This several person all can include with Statistical Operator, regular expression, business factor concept and/or generic concept.
To information extraction rules in this present embodiment, it can include Statistical Operator, can not also include Statistical Operator.
It can include one kind or arbitrary several in business factor concept, generic concept, Statistical Operator and regular expression, according to answering
With the selection of the difference of scene it is one such or it is several be combined, to achieve the purpose that more accurate Extracting Information.Work as packet
Containing it is several when, Factors ' Concept/generic concept and Statistical Operator, or there is sequencing relationship between the regular expression
And/or logical operation relationship.
It is further illustrated below with example.
Extract expression formula 3:@(PD_PER | PD_POS) evaluation of@c_ commendations
Wherein, generic concept (c) has:
C_ commendations are evaluated:Very beautiful, produce are abundant, very clever.
Text 3:Although Wang Erni never has any formal schooling, but her son Zhang Fei is very clever.
Preposition locating rule is sky, using the first character of text 3 as starting position.Postposition locating rule is " c_ commendations
Evaluation ", " very clever " that can be matched in text 3, using this matched position as end position.According to starting position and end
Position, it may be determined that effective extraction region in text 3 is " although Wang Erni never has any formal schooling, but her son Zhang Fei ".
It is identified with the first model that PD is characterized in effectively extracting region, " Wang Erni " can be recognized, " opened
Fly ", respectively the two marks name label.The specified label carried with the PD in information extraction rules due to the type of the two
" _ PER " wants to match, so " Wang Erni ", " Zhang Fei " match, Ke Yicong with information extraction rules " (PD_PER | PD_POS) "
" Wang Erni ", " Zhang Fei " the two character strings are extracted in text 3.
Optionally, the spacing distance between rule and information extraction rules can be determined with limited area.
For example, extracting expression formula 4:@(PD_PER | PD_POS) evaluation of@{ 0,2 } c_ commendations
Wherein, region determines that " { 0,2 } " in rule indicates that the name extracted or place name are evaluated with generic concept c_ commendations
Spacing distance between matched position is 0-2 character.
Text 3 above-mentioned is continued to use, due to " Wang Erni " although information extraction rules can be matched, itself and " very clever "
Between spacing distance be more than 2 characters;Spacing distance between " Zhang Fei " and " very clever " is 0 character, so only from
" Zhang Fei " this character string is extracted in text 3.
Referring to FIG. 7, in the second embodiment, providing a kind of information extraction method, including the step of following S500-S800
Suddenly.
S500:It obtains the text of information to be extracted and extracts expression formula, the extraction expression formula includes that region determines rule
And information extraction rules, include Statistical Operator in described information decimation rule, the Statistical Operator characterization is for identification in text
Name entity and/or interdependent ingredient statistical model.
S600:Name entity and/or interdependent ingredient in the text are identified using statistical model, for the name identified
Entity and/or interdependent ingredient mark corresponding identification label respectively.
S700:Determine that rule determines effective extraction region in the text using the region.
S800:Using the identification label, extracted and described information decimation rule from effective extraction region
The character string matched.
The text of information to be extracted in step S500, interdependent ingredient, Statistical Operator, extracts expression formula etc. at name entity
Description can refer to the associated description of S100 steps in one embodiment, and details are not described herein again.The area of this step and S100 steps
It is not, in the extraction expression formula obtained in this step, information extraction rules contain Statistical Operator, and region determines in rule
It may include also not including Statistical Operator.
The associated description of S200 steps in step S600 and one embodiment, details are not described herein again.
If it does not include any one Statistical Operator that region, which determines in rule, the step of S600, logically should be at
After the step of S700, it can specifically include:Using statistical model identify it is described it is effective extract region in name entity and/or
Interdependent ingredient marks corresponding identification label respectively for the name entity identified and/or interdependent ingredient.
In step S700, if not including Statistical Operator, determine that the method for effective coverage can be direct using the rule
Using the method in the prior art such as regular expression matching.If it also includes Statistical Operator, step that region, which determines in rule,
The step of S700, can specifically include:The region is compared using the identification label and determines rule and the text, determines institute
State effective extraction region in text.
This step specifically refers to the associated description of S300 steps in one embodiment, and details are not described herein again.
The step of step S800, can refer to the feelings for including Statistical Operator in one embodiment in information extraction rules
The associated description of S410 under condition, details are not described herein again.
Similarly with one embodiment, if information extraction rules include the Statistical Operator and characterization for characterizing the first model
Any of the Statistical Operator of second model, and it includes the Statistical Operator and table for characterizing the first model that region, which determines rule only,
Another in the Statistical Operator of the second model is levied, then the step of S600 includes:
S601:Using the name entity in text described in the first Model Identification, the name entity mark identified for each
Remember the first label;
S602:Using the interdependent ingredient effectively extracted described in the second Model Identification in region, for each identify according to
It is saved as minute mark and remembers the second label.
Alternatively, the step of S600, includes:
S603:Using the interdependent ingredient in text described in the second Model Identification, identified for each interdependent at minute mark
Remember the second label;
S604:Using the name entity effectively extracted described in the first Model Identification in region, the life identified for each
Name the first label of entity indicia.
In this way, the statistical model without being characterized all Statistical Operators extracted in expression formula is all used for
To text identification one time, but first the statistical model that region determines that the Statistical Operator in rule be characterized can be used to know text
It other one time, after determining effectively extraction region, then is determined in rule with include in information extraction rules, region and does not include
The statistical model that Statistical Operator is characterized is known to reduce required for the statistical model of part to effectively extracting region recognition one time
The length of other text promotes recognition speed, and then promotes information extraction speed.
Optionally, similarly with one embodiment, the type of the first label may include name label, place name label and
Mechanism label, the type of the second label include core component label, interdependent word label, agent ingredient label and word denoting the receiver of an action into minute mark
Label.Either region determine rule or information extraction rules in, can include regular expression, business factor concept and/
Or generic concept.When region determines that rule includes in regular expression, business factor concept, generic concept, Statistical Operator
When one or more, different regular expressions, business factor concept, generic concept and/or Statistical Operator can be combined,
There is sequencing relationship and/or logical operation relationship i.e. between them.The correlation specifically referred in one embodiment is retouched
It states, details are not described herein again.
In the third embodiment of the application, a kind of information extraction dress corresponding with aforementioned information abstracting method is provided
It sets, referring to FIG. 8, in the first realization method, including:
First acquisition unit 1, text and extraction expression formula, the extraction expression formula for obtaining information to be extracted include
Region determines rule and information extraction rules, the region determine in rule to include Statistical Operator, and the Statistical Operator characterization is used
In the statistical model for identifying name entity and/or interdependent ingredient in text;
First processing units 2, for identifying name entity and/or interdependent ingredient in the text using statistical model,
Corresponding identification label is marked respectively for the name entity identified and/or interdependent ingredient;Institute is compared using the identification label
It states region and determines rule and the text, determine effective extraction region in the text;And from effective extraction region
In extract and the matched character string of described information decimation rule.
Optionally, the first processing units 2 are specifically used for including the feelings of Statistical Operator in described information decimation rule
Under condition, using the identification label, extracted and the matched character of described information decimation rule from effective extraction region
String.
Optionally, the statistical model includes the first model of name entity and for identification interdependent ingredient for identification
Second model, the identification label include the first label and the second label;The first processing units 2 are specifically additionally operable to described
Information extraction rules include any of the Statistical Operator for the second model of Statistical Operator and characterization for characterizing the first model, and
In addition the region determines in Statistical Operator of the rule only including the second model of Statistical Operator and characterization for characterizing the first model
It is each using name entity/interdependent ingredient in text described in the Model Identification of the first model/second in the case of one
The label of the name entity identified/the first label of interdependent ingredient label/second;And utilize the Model Identification of the second model/first
Effective interdependent ingredient/name entity extracted in region, the interdependent ingredient/name entity indicia identified for each the
The label of two labels/first.
Optionally, the type of first label includes name label, place name label and mechanism label, second label
Type include core component label, interdependent word label, agent ingredient label and word denoting the receiver of an action ingredient label.
The first processing units 2 specifically be additionally operable to the name entity gone out using first Model Identification be name,
In the case of place name or mechanism, for the corresponding name label of name entity indicia, place name label or the mechanism mark identified
Label;It is core component, the feelings of interdependent word, agent ingredient or word denoting the receiver of an action ingredient in the interdependent ingredient gone out using second Model Identification
Under condition, for the interdependent ingredient identified mark corresponding core component label, interdependent word label, agent ingredient label or by
Thing ingredient label;It compares the region and determines rule and the text;And rule and the text are determined according to the region
Matched position determines and effectively extracts region.Wherein, if the region determines that the Statistical Operator in rule is carried specified
The type of the label of label and first label/second matches, then the Statistical Operator with mark first label/the second
The string matching of label, the specified label be used for characterize user it is expected identified from text name entity type or
The type of interdependent ingredient.
Optionally, the first processing units 2 are specific is additionally operable to compare described information decimation rule and effective extraction
Region, and, it extracts and the matched character string of described information decimation rule.Wherein, if system in described information decimation rule
The type for calculating the label of the specified label that is carried of son and first label/second matches, then the Statistical Operator with mark
Remember the string matching of the label of first label/second.
Optionally, described information decimation rule or region determine that rule further includes regular expression, wherein the statistics is calculated
It is sub that there is sequencing relationship and/or logical operation relationship between the regular expression.The region determines rule or institute
It further includes business factor concept/generic concept to state information extraction rules, the business factor concept/generic concept with it is described
Statistical Operator, or there is sequencing relationship and/or logical operation relationship between the regular expression.
Referring to FIG. 9, in second of realization method, which includes:
Second acquisition unit 3, text and extraction expression formula, the extraction expression formula for obtaining information to be extracted include
Region determines rule and information extraction rules, includes Statistical Operator in described information decimation rule, and the Statistical Operator characterization is used
In the statistical model for identifying name entity and/or interdependent ingredient in text;
Second processing unit 4, for identifying name entity and/or interdependent ingredient in the text using statistical model,
Corresponding identification label is marked respectively for the name entity identified and/or interdependent ingredient;Determine rule really using the region
Effective extraction region in the fixed text;And using the identification label, extracted from effective extraction region with
The matched character string of described information decimation rule.
Second processing unit 4 specifically can mutually be referred to the first realization method, and details are not described herein again.Above-mentioned information
Draw-out device is corresponding with the information extraction method in one embodiment and second embodiment, has and is extracted with aforementioned information
The corresponding advantageous effect of method, also repeats no more herein.
The same or similar parts between the embodiments can be referred to each other in this specification.Invention described above is real
The mode of applying is not intended to limit the scope of the present invention..