CN108647194A - information extraction method and device - Google Patents

information extraction method and device Download PDF

Info

Publication number
CN108647194A
CN108647194A CN201810401030.1A CN201810401030A CN108647194A CN 108647194 A CN108647194 A CN 108647194A CN 201810401030 A CN201810401030 A CN 201810401030A CN 108647194 A CN108647194 A CN 108647194A
Authority
CN
China
Prior art keywords
label
rule
text
region
ingredient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810401030.1A
Other languages
Chinese (zh)
Other versions
CN108647194B (en
Inventor
李德彦
晋耀红
吴相博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shenzhou Taiyue Software Co Ltd
Original Assignee
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenzhou Taiyue Software Co Ltd filed Critical Beijing Shenzhou Taiyue Software Co Ltd
Priority to CN201810401030.1A priority Critical patent/CN108647194B/en
Publication of CN108647194A publication Critical patent/CN108647194A/en
Application granted granted Critical
Publication of CN108647194B publication Critical patent/CN108647194B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/131Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The embodiment of the present invention discloses a kind of information extraction method and device, this method include:It obtains the text of information to be extracted and extracts expression formula, the extraction expression formula includes that region determines rule and information extraction rules, the region determines in rule to include Statistical Operator, and the Statistical Operator characterizes the statistical model of name entity and/or interdependent ingredient in text for identification;Name entity and/or interdependent ingredient in the text are identified using statistical model, and corresponding identification label is marked respectively for the name entity identified and/or interdependent ingredient;The region is compared using the identification label and determines rule and the text, determines effective extraction region in the text;It is extracted and the matched character string of described information decimation rule from effective extraction region.The above method calls statistical model in regular fashion, convenient, flexible, while expanding the range of identification vocabulary, reduces rule structure, more accurately extracts the information of user's needs.

Description

Information extraction method and device
Technical field
The present invention relates to text-processings and information extraction field, and in particular to a kind of method of information extraction.In addition, this hair It is bright to further relate to a kind of device of information extraction.
Background technology
Information extraction (Information Extraction) be from natural language text extract specified type entity, The factural informations such as relationship, event, and form the text-processing technology of structural data output.It can be used as intelligent answer, language The leading information process flow for the operations such as adopted information deep layer is excavated, normalization information is extracted.
The method that information extraction mainly uses is rule-based abstracting method, generally comprises two stages:Structure rule Expression formula, and application regular expression obtain the required information of user.Regular expression is built mainly by modeling personnel It is built according to extraction demand and experience.Multiple regular expressions are organized in a specified pattern, can be referred to as regular mould Type.Using the regular expression and text matches in rule model, so that it may to extract the required information of user from text.
One good rule model can reach higher standard in accuracy and accuracy, but when rule model structure Wait the modeling personnel for not only needing profession, it is also necessary to which exhaustion needs the text element being matched to, and expends a large amount of manpower and time. For example, the structure element if necessary to be used as regular expression with place name, that is, need accurately to identify place name from text, And reduce omission to the greatest extent, this requires modeling personnel, and by all place names, all exhaustion comes out one by one.Therefore, structure is taken out for information It will take a lot of manpower and time for the rule model taken, this is those skilled in the art's urgent problem to be solved.
Invention content
In order to solve the above technical problems, the application provides a kind of information extraction method, to reduce spent by structure rule A large amount of manpower and time, information is more all-sidedly and accurately extracted from text.
In a first aspect, a kind of information extraction method is provided, including:It obtains the text of information to be extracted and extracts expression formula, The extraction expression formula includes that region determines rule and information extraction rules, the region determine in rule to include Statistical Operator, The Statistical Operator characterizes the statistical model of name entity and/or interdependent ingredient in text for identification;
Name entity and/or interdependent ingredient in the text are identified using statistical model, for the name entity identified And/or interdependent ingredient marks corresponding identification label respectively;
The region is compared using the identification label and determines rule and the text, determines effective pumping in the text Take region;
It is extracted and the matched character string of described information decimation rule from effective extraction region.
With reference to first aspect, include in described information decimation rule in the first possible realization method in first aspect Statistical Operator;
The step of extracting character string matched with information extraction rules from effective extraction region, specifically includes:
Using the identification label, extracted and the matched word of described information decimation rule from effective extraction region Symbol string.
The first realization method with reference to first aspect, in second of possible realization method of first aspect, the system Meter model includes the first model of name entity and for identification the second model of interdependent ingredient for identification, the identification label Including the first label and the second label;
If the region determines the statistics of Statistical Operator and characterization second model of the rule only including the first model of characterization Any of operator, and described information decimation rule includes the second model of Statistical Operator and characterization for characterizing the first model Another in Statistical Operator, then identify name entity and/or interdependent ingredient in the text with statistical model, for identification The step of name entity and/or interdependent ingredient gone out marks corresponding identification label respectively, specifically includes:
Using name entity/interdependent ingredient in text described in the Model Identification of the first model/second, identified for each Name entity/the first label of interdependent ingredient label/second label;
It is each using the interdependent ingredient/name entity effectively extracted described in the Model Identification of the second model/first in region The label of a interdependent ingredient identified/the second label of name entity indicia/first.
With reference to first aspect and above-mentioned possible realization method, in first aspect in the third possible realization method, institute The type for stating the first label includes name label, place name label and mechanism label, the type of second label include core at Minute mark label, interdependent word label, agent ingredient label and word denoting the receiver of an action ingredient label;
The step of name the first label of entity indicia identified for each, including:
If being name, place name or mechanism using the name entity that first Model Identification goes out, identified to be described The corresponding name label of name entity indicia, place name label or mechanism label;
The step of interdependent the second label of ingredient label identified for each, including:
If being core component, interdependent word, agent ingredient or word denoting the receiver of an action using the interdependent ingredient that second Model Identification goes out Ingredient, then be the interdependent ingredient that identifies mark corresponding core component label, interdependent word label, agent ingredient label or Word denoting the receiver of an action ingredient label;
The region is compared using the identification label and determines rule and the text, determines effective pumping in the text The step of taking region, including:
It compares the region and determines rule and the text, wherein if the region determines the Statistical Operator in rule The type of the label of the specified label carried and first label/second matches, then the Statistical Operator with described in label The string matching of the label of first label/second, the specified label are used to characterize user and it is expected the life identified from text The name type of entity or the type of interdependent ingredient;
The position of rule and the text matches is determined according to the region, is determined and is effectively extracted region.
With reference to first aspect and above-mentioned possible realization method, in the 4th kind of possible realization method of first aspect, profit With the identification label, the step with the matched character string of described information decimation rule is extracted from effective extraction region Suddenly, including:
Compare described information decimation rule and effective extraction region, wherein if in described information decimation rule The type of the label of the specified label that Statistical Operator is carried and first label/second matches, then the Statistical Operator with Mark the string matching of the label of first label/second;
It extracts and the matched character string of described information decimation rule.
With reference to first aspect and above-mentioned possible realization method, in the 4th kind of possible realization method of first aspect, institute It states region and determines that rule further includes regular expression, wherein have successively between the Statistical Operator and the regular expression Ordinal relation and/or logical operation relationship.
With reference to first aspect and above-mentioned possible realization method, in the 4th kind of possible realization method of first aspect, institute It states region and determines that rule or described information decimation rule further include business factor concept/generic concept, the business factor concept/ The generic concept and the Statistical Operator, or between the regular expression there is sequencing relationship and/or logic to transport Calculation relationship.
Second aspect provides a kind of information extraction method, including:
It obtains the text of information to be extracted and extracts expression formula, the extraction expression formula includes that region determines rule and information Decimation rule includes Statistical Operator in described information decimation rule, the Statistical Operator characterization name in text for identification The statistical model of entity and/or interdependent ingredient;
Name entity and/or interdependent ingredient in the text are identified using statistical model, for the name entity identified And/or interdependent ingredient marks corresponding identification label respectively;
Determine that rule determines effective extraction region in the text using the region;
Using the identification label, extracted and the matched word of described information decimation rule from effective extraction region Symbol string.
The third aspect provides a kind of information extraction device, including:
First acquisition unit, text and extraction expression formula, the extraction expression formula for obtaining information to be extracted include Region determines rule and information extraction rules, the region determine in rule to include Statistical Operator, and the Statistical Operator characterization is used In the statistical model for identifying name entity and/or interdependent ingredient in text;
First processing units are for identifying name entity and/or interdependent ingredient in the text using statistical model The name entity and/or interdependent ingredient identified marks corresponding identification label respectively;Described in identification label comparison Region determines rule and the text, determines effective extraction region in the text;And from effective extraction region It extracts and the matched character string of described information decimation rule.
Fourth aspect provides a kind of information extraction device, including:
Second acquisition unit, text and extraction expression formula, the extraction expression formula for obtaining information to be extracted include Region determines rule and information extraction rules, includes Statistical Operator in described information decimation rule, and the Statistical Operator characterization is used In the statistical model for identifying name entity and/or interdependent ingredient in text;
Second processing unit is for identifying name entity and/or interdependent ingredient in the text using statistical model The name entity and/or interdependent ingredient identified marks corresponding identification label respectively;Determine that rule determines using the region Effective extraction region in the text;And using the identification label, extracted from effective extraction region and institute State the matched character string of information extraction rules.
In the information extraction method of the application, the text of information to be extracted is obtained first and extracts expression formula, the extraction Expression formula includes that region determines that rule and information extraction rules, the region determine in rule and/or described information decimation rule Including Statistical Operator introduces to the statistical model of entity and/or interdependent ingredient be named to be defined as Statistical Operator for identification To extracting in expression formula, obtain extracting expression formula.Then utilize statistical model identify name entity in the text and/or according to It is saved as point, corresponding identification label is marked respectively for the name entity identified and/or interdependent ingredient.The identification is recycled to mark Label compare the region and determine rule and the text, effective extraction region in the text are determined, from effective extraction It is extracted in region and the matched character string of described information decimation rule;Alternatively, being determined described in regular determine using the region Effective extraction region in text extracts from effective extraction region and is taken out with described information using the identification label Take the character string of rule match.In this way, it calls in regular fashion and names entity and/or interdependent for identification The statistical model of ingredient, it is very easy to use flexible during so that it is participated in extraction expression formula and text matches.With Simple regular expression is compared, and the range of identification vocabulary is expanded, and can extract the information of user's needs more fully hereinafter, It avoids expending a large amount of manpower and time when building regular expression simultaneously;With the simple method phase based on statistical model Than can more accurately extract the information of user's needs.
Description of the drawings
In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without having to pay creative labor, Other drawings may also be obtained based on these drawings.
Fig. 1 is the flow chart of one embodiment of the application information extraction method;
Fig. 2 is the flow chart of one of specific implementation of one embodiment of the application information extraction method;
Fig. 3 is two flow chart of the specific implementation of one embodiment of the application information extraction method;
Fig. 4 is three flow chart of the specific implementation of one embodiment of the application information extraction method;
Fig. 5 be the application information extraction method one embodiment in, the one of which realization method of S300 steps Flow chart;
Fig. 6 be the application information extraction method one embodiment in, the one of which realization method of S410 steps Flow chart;
Fig. 7 is the flow chart in second embodiment of the application information extraction method;
Fig. 8 is the structural schematic diagram of one of specific implementation mode of information extraction device of the application;
Fig. 9 is two structural schematic diagram of the specific implementation mode of the information extraction device of the application.
Specific implementation mode
It elaborates below to embodiments herein.
In rule-based abstracting method, regular expression includes information extraction rules, and information extraction rules are used for User is extracted in the text it is expected the information extracted.For example, by information extraction rules " of medium height | build is general " and text This matching, it is such to describe that the information of build in the text when occurring " of medium height " or " build is general " in text It is extracted.For more fully Extracting Information, modeling personnel need exhaustion one by one to go out all possible expression form to carry out structure Regular expression is built, a large amount of manpower and time are expended.
Other than rule-based abstracting method, the abstracting method based on statistics can also be utilized come Extracting Information.I.e. Statistical model, such as Hidden Markov Model are trained using user is marked wishing the language material of the information extracted first (HMM), then maximum entropy model (MEMM), conditional random field models (CRF), supporting vector machine model (SVM) etc. utilize training Good statistical model carrys out Extracting Information.Regular expression is built without the modeling personnel of profession using the abstracting method based on statistics Formula, the manpower of saving and time.But compared with rule-based abstracting method, the abstracting method based on statistics on the whole It is poor in terms of accuracy and accuracy.This is to be primarily due on the one hand, and training corpus is not comprehensive enough to answer statistical model Accuracy impacts;On the other hand, when the extraction demand of user is more complex, statistical model is not only simply extracted Whens name entity being good at etc., the accuracy based on the abstracting method of statistical model in application can also be affected.
For this purpose, the application proposes a kind of new information extraction method, entity and/or interdependent ingredient will be named for identification Statistical model is defined as Statistical Operator, is introduced into regular expression, to obtain extracting expression formula.It adjusts in regular fashion With the statistical model for naming entity and/or interdependent ingredient for identification, so that it is participated in the process with text matches, use non- It is often convenient, flexible.Text is handled using the extraction expression formula, compared with simple regular expression, expands identification vocabulary Range, can extract more fully hereinafter user needs information, while avoid build rule when expend a large amount of manpower and Time;Compared with the simple method based on statistical model, the information of user's needs can be more accurately extracted.
Extraction expression formula in the application includes two parts:Region determines rule and information extraction rules.Characterization statistics mould The Statistical Operator of type both can only be introduced into region and determine in rule, can also only be introduced into information extraction rules, can be with Region is introduced into simultaneously to determine in rule and information extraction rules.Understand these three situations for the ease of clearly articulating, below It will respectively be described by two embodiments:In one embodiment, region determines in rule to include Statistical Operator, and information is taken out It takes and may include can not also including Statistical Operator in rule;In the second embodiment, it is calculated comprising statistics in information extraction rules Son, region determine in rule may include can not also including Statistical Operator.
Referring to FIG. 1, in one embodiment, a kind of information extraction method is provided, the step of following S100-S400 is included Suddenly.
S100:It obtains the text of information to be extracted and extracts expression formula, the extraction expression formula includes that region determines rule And information extraction rules, the region determine in rule to include Statistical Operator, the Statistical Operator characterization is for identification in text Name entity and/or interdependent ingredient statistical model.
In this application, the text of information to be extracted can be derived from the text of internet, can also be source Mr. Yu The text etc. of a specific database, the application are not construed as limiting the source of the text of Extracting Information and form.
Name entity is that (named entity) is exactly name, mechanism name, place name and other are all with entitled mark Entity, wider entity can also include number, date, currency, address etc..
Interdependent ingredient refers to the syntactic constituent that sentence is included, for example, core component, interdependent word, agent ingredient, word denoting the receiver of an action at Grade.Exist in a sentence, between word and word with directive domination with the relationship dominated, the title being top dog For the core component in governing word, that is, the application, it is in and is known as in dependent, that is, the application by ascendancy Interdependent word.In general, center of the verb as sentence dominates the other compositions in sentence, that is to say, that these ingredients with Various dependences are subordinated to verb, and this relationship is unidirectional.In addition to the syntaxes such as core component and interdependent word can be analyzed Except ingredient, in a sentence, the semantic role for including predicate (verb or noun) can also be analyzed, such as Agent, by Thing person etc., each semantic role are endowed certain semantic meaning, the Agent in sentence be exactly agent in the application at Point, word denoting the receiver of an action person is exactly the word denoting the receiver of an action ingredient in the application.
Statistical Operator characterizes the statistical model of name entity and/or interdependent ingredient in text for identification, that is to say, that The form of statistical model Statistical Operator is indicated, is extracted in expression formula consequently facilitating being applied to.Here statistical model is Refer to and used the trained model of training corpus with mark, that is, the statistical model that parameter has determined.
It includes two parts to extract expression formula:Region determines rule and information extraction rules.Region determines rule in text It is determined in this and effectively extracts region.In one embodiment, region determines that rule may include preposition locating rule and postposition Locating rule, preposition locating rule for determining initial position in the text, and postposition locating rule for determining knot in the text Beam position.After initial position and end position is determined, text between the two is effectively to extract region.In this feelings Under condition, preposition locating rule and postposition locating rule are both at least one comprising Statistical Operator, are considered as in locating rule Contain Statistical Operator.In another embodiment, region determines that rule may include centralized positioning rule, centralized positioning rule It is then used to determine center in the text, then according to center toward context extension predeterminable area, so that it is determined that effectively Extract region.
For the region that one includes Statistical Operator determines rule, it can only include Statistical Operator, can also wrap Containing Statistical Operator and regular expression.When including Statistical Operator and when regular expression, there is therebetween sequencing relationship And/or logical operation relationship.For example, preposition locating rule can be the form of " ten points of beauties of PD ", wherein PD is Entity recognition Operator indicates the statistical model for naming entity for identification, and " very beautiful " is regular expression.In this example, it counts Operator and regular expressions have sequencing relationship between being.The rule can be matched to similar " Wang Erni is very beautiful " in this way Character string.In another example preposition locating rule can also be form as " PD (very beautiful | very beautiful) ", wherein " very It is beautiful ", " very beautiful " be all regular expression, there is therebetween logical operation relationship "or", and PD and (very beautiful | very beautiful) There is sequencing relationship between this entirety.That is, between Statistical Operator and Statistical Operator, regular expression and canonical There can be sequencing relationship between expression formula or between Statistical Operator and regular expression and/or logical operation is closed System.
Information extraction rules, which are used to extract user in effectively extracting region, it is expected the information extracted.In the present embodiment In, information extraction rules can only include regular expression, can also include regular expression and Statistical Operator.When including canonical When expression formula and Statistical Operator, there is therebetween sequencing relationship and/or logical operation relationship, and it is above-mentioned similar, this Place repeats no more.
In one embodiment, information extraction rules can between preposition locating rule and postposition locating rule, two Separated with "@" between two, that is, it is " preposition locating rule@information extraction rules@postpositions locating rule " to extract expression formula.Here, Preposition locating rule or postposition locating rule can be sky.When preposition locating rule is empty, then it is defaulted as with entire chapter text First character is initial position;When postposition locating rule is empty, then be defaulted as be with the last character of entire chapter text End position.Preposition locating rule or postposition locating rule, which are empty situation, can be considered as the locating rule only comprising regular expressions A kind of special circumstances of formula.
Optionally, statistical model includes the first model of name entity and for identification the second of interdependent ingredient for identification Model.Statistical Operator includes Entity recognition operator PD and interdependent ingredient operator DC, and Entity recognition operator PD characterizes the first model, according to It is saved as a point operator DC and characterizes the second model.
HMM model, CRF models etc. may be used in first model.In the trained stage, instructed using the language material with mark Practice, the major parameter of model is determined, to obtain trained first model.In service stage, by text to be identified It is input in statistical model, so that it may to export the name entity in the text to be identified.Similarly, the second model can also It is trained using HMM model, MEMM models, CRF models etc., the language material for the band mark that only training uses and training first The difference of model, so the obtained major parameter of model of training is also different to get to the second different models.For first, The training of second model, if being used as training corpus using the language material of the band mark under different application scenarios, trained The model parameter arrived will be different, so that trained statistical model can be more applicable under processing specific application scene Language material.For example, if training corpus is all the finance and economic news marked, trained statistical model can be more applicable for Finance and economic news is handled, i.e., identifies name entity or interdependent ingredient from finance and economic news.Specific training statistical model Method in the prior art may be used in method, and details are not described herein again.
For example, it is " PD@(of medium height | build is general)@" to extract expression formula 1.In this example, preposition positioning Rule only includes Statistical Operator PD, when preposition locating rule is matched with text, will call corresponding statistics Model identifies the name entity in text, is just initial position by the location determination if recognize name entity;Postposition is fixed Position rule is sky, that is, using the last character of text as end position;Information extraction rules be " (it is of medium height | build Generally) ", that is to say, that in effectively extracting region, if including " of medium height " or " build is general ", is just extracted Come.
S200:Name entity and/or interdependent ingredient in the text are identified using statistical model, for the name identified Entity and/or interdependent ingredient mark corresponding identification label respectively.
In the S200 the step of, name entity and/or interdependent ingredient in the text are identified using statistical model, it will be literary This input data as statistical model, output number of the name entity and/or interdependent ingredient that will identify that as statistical model According to using method in the prior art progress, details are not described herein again.Name entity that each in text identifies and/ Or interdependent ingredient all corresponding identification labels of label one respectively.
When extract contain a variety of Statistical Operators in expression formula when, the name that can be identified for each statistical model Entity or interdependent ingredient mark corresponding different identification label.Such as in one embodiment, table 1 is can refer to, statistics is calculated Son may include Entity recognition operator and interdependent ingredient operator, be indicated respectively with " PD " and " DC ".Statistical model may include First model of name entity and for identification the second model of interdependent ingredient for identification, Entity recognition operator characterize the first mould Type, interdependent ingredient operator characterize the second model.It identifies that label includes the first label and the second label, is gone out using the first Model Identification Content all be name entity, for its first label of label, the content gone out using the second Model Identification is all interdependent ingredient, be its The second label of label.
The correspondence example one of more than 1 a Statistical Operator of table, multiple statistical models and multiple identification labels
S300:The region is compared using the identification label and determines rule and the text, is determined in the text Effectively extract region.
Since region determines rule, there are a variety of specific implementation forms, it is thus determined that effectively extracting the specific steps in region There are little bit differents.For example, in one implementation, it includes preposition locating rule and postposition positioning rule that region, which determines that rule is, Rule then, at least one of preposition locating rule and postposition locating rule include Statistical Operator in the present embodiment.Pass through Preposition locating rule is assured that initial position with the text matches, passes through postposition locating rule and text matches Determine end position.After initial position and end position is determined, text between the two is effectively to extract region.Again For example, in another implementation, region determines that rule includes centralized positioning rule, and centralized positioning rule includes that statistics is calculated Son is assured that center using centralized positioning rule and text matches, then according to center toward context extension Predeterminable area, so that it is determined that effectively extracting region.But either any way of realization, in preposition locating rule and/or postposition When matching is compared with text in locating rule or centralized positioning rule, be required for using corresponding identification label come Judge whether preposition locating rule, postposition locating rule or centralized positioning rule match with text.
Specifically, if in preposition locating rule, postposition locating rule or centralized positioning rule only including Statistical Operator, As long as the corresponding statistical model of so one identification label and the statistical model of Statistical Operator characterization are the same models, so that it may with Think that the character string (name entity or interdependent ingredient) of identification label label is matched with the Statistical Operator in the rule.Due to preceding It only includes Statistical Operator to set in locating rule, postposition locating rule or centralized positioning rule, so the rule and the identification label The string matching of label.If in preposition locating rule, postposition locating rule or centralized positioning rule both having included Statistical Operator Include regular expression again, then rule need on the whole with text matches, that is to say, that character string in text is in addition to needing Other than being matched with Statistical Operator, it is also required to and Statistical Operator in rule with the character string before and after the matched character string of Statistical Operator Front and back regular expression matching, i.e. Statistical Operator, regular expression and having therebetween and/or is patrolled at sequencing relationship Volume operation relation can in text matches.
For example, it is assumed that preposition locating rule is " PD (very beautiful | very beautiful) ", then the rule can match in text Character string " Wang Erni is very beautiful ", " Li Ermei is very beautiful " etc., but can not match that similar " Wang Erni is very beautiful ", " king two The younger sister of girl is very beautiful " as character string.
S400:It is extracted and the matched character string of information extraction rules from effective extraction region.
In one implementation, it only contains regular expression in information extraction rules, does not include Statistical Operator, then it is sharp It is matched with regular expression with the word in effective extraction region, so that it may to extract and the matched word of information extraction rules Symbol string, that is, user it is expected the information extracted from text.
Illustrated below with example.
The text 1 of information to be extracted:
Wang Erzhu is general with precursor type, he took regular exercise daily later, now very strong strong.
Extract expression formula 1:PD@(it is of medium height | build is general)@
The Statistical Operator PD in rule is determined using region, that is, the statistical model identification that Entity recognition operator is characterized It is name entity to go out in text 1 " Wang Erzhu ", is " Wang Erzhu " marker recognition label --- the first label in text 1.Then Region is determined that the preposition locating rule " PD " in rule is compared with text 1, due to marked on " Wang Erzhu " in text 1 the Statistical model and the statistical models characterized of PD in preposition locating rule " PD " corresponding to one label are the same models, therefore " Wang Erzhu " is matched with preposition locating rule " PD ", is initial position by the location determination of " Wang Erzhu " in text 1.Postposition is fixed Position rule is sky, so the last character of text 1 is determined as end position.It is effectively extracted so that it is determined that going out in text 1 Region 1 is that " general with precursor type, he adhered to body-building daily later, now very strong strong.”.
In this example, if the postposition locating rule of expression formula to be extracted 1 replaces with " (DC) { 0,10 } is strong ", Middle DC is interdependent ingredient operator, and the statistical model characterized using the operator can identify the interdependent ingredient in text;“.{0, 10 } strong " it is regular expression.If there are an interdependent ingredients in text, and 0-10 word after the interdependent ingredient " strong " the two characters are contained in symbol, mean that the interdependent ingredient to strong this section of character string and the postposition locating rule Match.It applies in text 1, the statistical model characterized using DC can identify " exercise " this interdependent ingredient, be it Mark corresponding identification label --- the second label.The statistical model characterized by the corresponding statistical model of the second label and DC It is same, therefore the character string " exercise " that the second label is marked can be matched with " (DC) " in the postposition locating rule. It has been matched to " strong " again in 10 characters after " exercise ", so " taking exercise, now very strong strong " this character String is just matched with the postposition locating rule.So that it is determined that go out in text 1 effectively to extract region 2 be " it is general with precursor type, later he Adhere to daily ".
Information extraction rules are " of medium height | build is general ", therefore can extract " body in effectively extracting region 1 Type is general " this character string.
Optionally, include Statistical Operator referring to FIG. 2, in another realization method of S400, in information extraction rules, I.e. as shown in step S101 in Fig. 2, then extracted and the matched character string of information extraction rules from effective extraction region The step of, it specifically includes:
S410:Using the identification label, extracted and described information decimation rule from effective extraction region The character string matched.
Here, when including Statistical Operator in information extraction rules, with effective mistake for extracting the string matching in region Journey with the step of aforementioned S300 in preposition locating rule etc. with text to carry out matched process similar.If in information extraction rules Only include Statistical Operator, as long as then the statistical model that the corresponding statistical model of identification label is characterized with Statistical Operator is same Model, so that it may to think that the character string (name entity or interdependent ingredient) of identification label label is matched with Statistical Operator.Due to Only include Statistical Operator in information extraction rules, so the character string and the rule match of identification label label, by the character String extracts.If not only having included Statistical Operator in information extraction rules but also having included regular expression, which needs On the whole with text matches, that is to say, that character string in text is calculated other than needing to match with Statistical Operator with statistics Character string before and after the matched character string of son is also required to calculate with the regular expression matching before and after Statistical Operator in rule, i.e. statistics Son, regular expression and have therebetween sequencing relationship and/or logical operation relationship can in text matches, Matched character string can just be extracted as the information extracted from text.
Still the example of text 1 above-mentioned is continued to use, extracting expression formula 2 is:PD@(it is of medium height | build is general) { 0,10 } DC@。
By preposition locating rule and after be appointed as rule, it may be determined that effectively extract region 1 be " it is general with precursor type, He took regular exercise daily later, now very strong strong.”.
It can identify in text 1 that " exercise " is interdependent ingredient using the interdependent ingredient operator DC statistical models characterized, For its second label of label.Replace " exercise " effectively extracted in region with the second label, due to the corresponding system of the second label It is same that model, which is counted, with the statistical model that DC is characterized, therefore the character string " exercise " of the second label label is positioned with the postposition " DC " in rule can be matched.It is matched to " build is general " in 0-10 character before " exercise ", so " build one As, he took regular exercise daily later " this character string just match with the information extraction rules, it taken out from effectively extraction region It takes out.
By the above method, rule is combined with statistics, neatly can determine that rule and/or information are taken out in region It takes and calls statistical model in rule, can also be combined with regular expression, obtain the abundanter extraction expression formula of form. Information is extracted using the extraction expression formula, compared with simple regular expression, expands the range of identification vocabulary, Ke Yigeng Add the information for comprehensively extracting user's needs, while avoiding expending a large amount of manpower and time when building rule;With it is simple The method based on statistical model compare, can more accurately extract user needs information.
When one, which is extracted, contains a variety of Statistical Operators in expression formula, the statistics that these Statistical Operators can be characterized Model marks corresponding identification label respectively respectively all to the text identification one time for the content identified.In addition, optional Ground, please refers to Fig.3 and Fig. 4, if the region determines rule only include the first model of characterization Statistical Operator and characterization second Any of Statistical Operator of model, and described information decimation rule includes the Statistical Operator and characterization for characterizing the first model Another in the Statistical Operator of second model, then the step of S200 includes:
S201:Using the name entity in text described in the first Model Identification, the name entity mark identified for each Remember the first label;
S202:Using the interdependent ingredient effectively extracted described in the second Model Identification in region, for each identify according to It is saved as minute mark and remembers the second label.
Alternatively, the step of S200, includes:
S203:Using the interdependent ingredient in text described in the second Model Identification, identified for each interdependent at minute mark Remember the second label;
S204:Using the name entity effectively extracted described in the first Model Identification in region, the life identified for each Name the first label of entity indicia.
In the case of containing other Statistical Operators for not including in region determination rule in information extraction rules, nothing The statistical model that be characterized of all Statistical Operators extracted in expression formula need to be all used for text identification one time, but can be first Determine that text identification one time, region is effectively extracted determining for the statistical model that is characterized of Statistical Operator in rule using region After, then with including in information extraction rules, region determine the statistical model that the Statistical Operator for not including in rule is characterized, To effectively extracting region recognition one time, to reduce the length of the text identified required for the statistical model of part, identification speed is promoted Degree, and then promote information extraction speed.
Herein it should be noted that in the realization method, the step of S202 and S204 logically known to be in S300 The step of after.In this application, the number of step is intended merely to facilitate description, is not used in each step in restriction method Sequentially, each step in method, as long as in logic rationally, the sequencing of execution can change.
Optionally, as shown in table 2, the type of first label may include name label, place name label and mechanism mark Label, the type of second label includes core component label, interdependent word label, agent ingredient label and word denoting the receiver of an action ingredient label.
In step S201 and/or step S204, the step of name the first label of entity indicia identified for each, Including:
If being name, place name or mechanism using the name entity that first Model Identification goes out, identified to be described The corresponding name label of name entity indicia, place name label or mechanism label.
In step S202 and/or step S203, the step of interdependent the second label of ingredient label identified for each, Including:
If being core component, interdependent word, agent ingredient or word denoting the receiver of an action using the interdependent ingredient that second Model Identification goes out Ingredient, then be the interdependent ingredient that identifies mark corresponding core component label, interdependent word label, agent ingredient label or Word denoting the receiver of an action ingredient label.
If region determines that rule includes Statistical Operator, referring to FIG. 5, the step of S300 may include:
S301:It compares the region and determines rule and the text, wherein if the region determines the statistics in rule The type of the label of the specified label that operator is carried and first label/second matches, then the Statistical Operator and label The string matching of the label of first label/second, the specified label it is expected to identify from text for characterizing user Name entity type or interdependent ingredient type;
S302:The position of rule and the text matches is determined according to the region, is determined and is effectively extracted region.
If information extraction rules include Statistical Operator, referring to FIG. 6, the step of S410 may include:
S411:Compare described information decimation rule and effective extraction region, wherein if described information decimation rule In the specified label that is carried of Statistical Operator and first label/second the type of label match, then the statistics is calculated Son and the string matching for marking the label of first label/second;
S412:It extracts and the matched character string of described information decimation rule.
The correspondence example two of more than 2 a Statistical Operator of table, multiple statistical models and multiple identification labels
Either region determines the Statistical Operator in rule or information extraction rules, can be to statistics mould that it is characterized The name entity that type is identified/interdependent ingredient is further classified, and upper different types of label is accordingly marked, to more Accurately it is expected the information extracted from text to limit user by extracting the specified label of Statistical Operator in expression formula, subtracts Few situation for extracting mistake, improves the accuracy of the information extracted.
For example, information extraction rules a is " PD { 0,2 } is very beautiful ", information extraction rules b is " (PD_PER) { 0,2 } ten Divide beauty ".
Effective extraction region of text 2 is:Henan Province is very beautiful, and the Wang Erni for being born in Henan Province is also very beautiful.
It is effectively extracted in region come the first Model Identification when extracting, characterized using PD using information extraction rules a Name entity " Henan Province ", " Henan Province ", " Wang Erni ", respectively the first label on the two string tokens.It then will letter Breath decimation rule a is matched with effective extraction region of text 2, corresponding to the first label in first " Henan Province " Statistical model be the first model, characterized with PD it is identical, so first " Henan Province " can be matched with PD.First " Henan Province " below exist " very beautiful ", and therebetween between be divided into 0 character, can in information extraction rules a Regular expression " { 0,2 } is very beautiful " match.Therefore, " Henan Province is very beautiful " matches with rule a, can be taken out It takes out.
Similar, second " Henan Province " can also be matched with PD, although being contained " very in character string thereafter It is beautiful ", the interval between second " Henan Province " has but been more than 2 characters, so can not match.
Similar, " Wang Erni " can also be matched with PD, and character string " very beautiful " thereafter and " Wang Erni " it Between between be divided into 1 character, can be matched with the regular expression " { 0,2 } is very beautiful " in information extraction rules a.Cause This, " Wang Erni is also very beautiful " matches with rule a, can be extracted.
It is effectively extracted in region come the first Model Identification when extracting, characterized using PD using information extraction rules b Entity " Henan Province ", " Henan Province ", " Wang Erni " are named, since two " Henan Province " are place name, so mark ground respectively for it Name label;" Wang Erni " is name, and name label is marked for it.
Then information extraction rules b is matched with effective extraction region of text 2, although first and second The statistical model corresponding to place name label in " Henan Province " is also the first model, is characterized with PD identical, but is united in rule b It is " _ PER " to calculate the specified label that son is carried, i.e., user it is expected that the type of the name entity identified from text is name This type cannot be matched with place name label, so the Statistical Operator " (PD_PER) " in rule b and two " Henan Province " are not It can match.
Similar, for " Wang Erni ", label behaviour name label, with Statistical Operator " (PD_PER) " energy in regular b Enough match.Also, " Wang Erni " subsequent " also very beautiful " regular expression " { 0,2 } is very beautiful " with rule b Also it can match, so " Wang Erni " is also very beautiful " matched with rule b, it can be extracted.
As can be seen that extracted using information extraction rules a, it can extract that " Henan Province is very in region from effective extract Beauty ", " Wang Erni is also very beautiful " two character strings.Extracted using information extraction rules b, then only can from it is same effectively It extracts in region and extracts " Wang Erni is also very beautiful " this character string.
Optionally, either region determine rule or information extraction rules in, can include business factor concept and/ Or generic concept.
In this application, generic concept refers to the word sense information and word of the vocabulary unrelated with specific business in text Semantic relevance between remittance.One generic concept can represent one group of vocabulary, can also indicate in short.Generic concept is pair The description of object reflects the abstract expression, such as time, place, mood, evaluation of essential attribute etc. of its described object.It is logical It can be often multiplexed in different fields, different application scenarios with concept.Generic concept can use " c " to indicate.
For example, for a generic concept " negative ", i.e., " c_ negatives ", it can represent " not ", " not having ", " never " etc. Vocabulary.That is, include " not " in the text, any of " not having " and " never ", it is considered as in the text The vocabulary this generic concept is matched with " c_ negatives ".
In another example for a generic concept " discontented ", i.e., " c_ is discontented ", it can be indicated " [^ is not] { 0,5 } is discontented ". Wherein, " [^ is not] { 0,5 } is discontented " indicates, in matched text, as long as including the text of 0~5 character before " discontented ", all It can be matched by " [^ is not] { 0,5 } is discontented ", such as " very discontented " etc., while it is anti-to exclude " not being discontented ", " not counting discontented " etc. To semantic sentence.Therefore, if including the text of 0~5 character in a text before " discontented ", and this 0~5 word Fu Zhongwei includes " no ", then it is assumed that this generic concept is matched to this character string with " c_ is discontented ".
Business factor concept refers to the semantic association between the semantic information and vocabulary of vocabulary related with specific business Property.Similar with generic concept, business factor concept can also represent one group of vocabulary, can also indicate in short.Business factor Concept is the description pair with the relevant object of business or its attribute, often related from field, different business, in different fields Or it cannot be multiplexed under different application scenarios.Business factor concept can use " e " to indicate.
For example, in bank card customer service field, business factor concept " puppet emits information ", i.e. " e_ puppets emit information " can be with Represent the vocabulary such as " puppet emits short message ", " puppet emits message ", " puppet emits incoming call ", " puppet emits mail ".When including that " puppet emits in a text Any of short message ", " puppet emits message ", " puppet emits incoming call " and " puppet emits mail " mean that the vocabulary and " e_ in the text Puppet emits information " this generic concept is matched.
Semantic model refers to towards known concept, and what conclusion exhaustion went out from sample data is used to describe known concept semanteme Text presentation form.By multiple generic concepts unrelated with business, is organized with tree structure, just constitute conceptional tree. One conceptional tree can be understood from being a semantic model.By the multiple and relevant business factor concept of business, with tree structure It organizes, just constitutes element tree.One element tree also is understood as being a semantic model.Utilize such semantic model Text can be identified, determine in text and whether there is and the generic concept or business factor concept matching in semantic model Character string.
In the scheme of the application, such generic concept or business factor concept can be also introduced into rule, with The abundanter extraction expression formula of the form of the composition, to accurately comprehensively extract information.
Since in the present embodiment, region determines that rule includes Statistical Operator, so also include when region determines in rule When business factor concept and/or generic concept, with similar, Statistical Operator and business factor the case where also including regular expression Also there is sequencing relationship and/or logical operation relationship between concept and/or generic concept.In addition, region determines in rule This several person all can include with Statistical Operator, regular expression, business factor concept and/or generic concept.
To information extraction rules in this present embodiment, it can include Statistical Operator, can not also include Statistical Operator. It can include one kind or arbitrary several in business factor concept, generic concept, Statistical Operator and regular expression, according to answering With the selection of the difference of scene it is one such or it is several be combined, to achieve the purpose that more accurate Extracting Information.Work as packet Containing it is several when, Factors ' Concept/generic concept and Statistical Operator, or there is sequencing relationship between the regular expression And/or logical operation relationship.
It is further illustrated below with example.
Extract expression formula 3:@(PD_PER | PD_POS) evaluation of@c_ commendations
Wherein, generic concept (c) has:
C_ commendations are evaluated:Very beautiful, produce are abundant, very clever.
Text 3:Although Wang Erni never has any formal schooling, but her son Zhang Fei is very clever.
Preposition locating rule is sky, using the first character of text 3 as starting position.Postposition locating rule is " c_ commendations Evaluation ", " very clever " that can be matched in text 3, using this matched position as end position.According to starting position and end Position, it may be determined that effective extraction region in text 3 is " although Wang Erni never has any formal schooling, but her son Zhang Fei ".
It is identified with the first model that PD is characterized in effectively extracting region, " Wang Erni " can be recognized, " opened Fly ", respectively the two marks name label.The specified label carried with the PD in information extraction rules due to the type of the two " _ PER " wants to match, so " Wang Erni ", " Zhang Fei " match, Ke Yicong with information extraction rules " (PD_PER | PD_POS) " " Wang Erni ", " Zhang Fei " the two character strings are extracted in text 3.
Optionally, the spacing distance between rule and information extraction rules can be determined with limited area.
For example, extracting expression formula 4:@(PD_PER | PD_POS) evaluation of@{ 0,2 } c_ commendations
Wherein, region determines that " { 0,2 } " in rule indicates that the name extracted or place name are evaluated with generic concept c_ commendations Spacing distance between matched position is 0-2 character.
Text 3 above-mentioned is continued to use, due to " Wang Erni " although information extraction rules can be matched, itself and " very clever " Between spacing distance be more than 2 characters;Spacing distance between " Zhang Fei " and " very clever " is 0 character, so only from " Zhang Fei " this character string is extracted in text 3.
Referring to FIG. 7, in the second embodiment, providing a kind of information extraction method, including the step of following S500-S800 Suddenly.
S500:It obtains the text of information to be extracted and extracts expression formula, the extraction expression formula includes that region determines rule And information extraction rules, include Statistical Operator in described information decimation rule, the Statistical Operator characterization is for identification in text Name entity and/or interdependent ingredient statistical model.
S600:Name entity and/or interdependent ingredient in the text are identified using statistical model, for the name identified Entity and/or interdependent ingredient mark corresponding identification label respectively.
S700:Determine that rule determines effective extraction region in the text using the region.
S800:Using the identification label, extracted and described information decimation rule from effective extraction region The character string matched.
The text of information to be extracted in step S500, interdependent ingredient, Statistical Operator, extracts expression formula etc. at name entity Description can refer to the associated description of S100 steps in one embodiment, and details are not described herein again.The area of this step and S100 steps It is not, in the extraction expression formula obtained in this step, information extraction rules contain Statistical Operator, and region determines in rule It may include also not including Statistical Operator.
The associated description of S200 steps in step S600 and one embodiment, details are not described herein again.
If it does not include any one Statistical Operator that region, which determines in rule, the step of S600, logically should be at After the step of S700, it can specifically include:Using statistical model identify it is described it is effective extract region in name entity and/or Interdependent ingredient marks corresponding identification label respectively for the name entity identified and/or interdependent ingredient.
In step S700, if not including Statistical Operator, determine that the method for effective coverage can be direct using the rule Using the method in the prior art such as regular expression matching.If it also includes Statistical Operator, step that region, which determines in rule, The step of S700, can specifically include:The region is compared using the identification label and determines rule and the text, determines institute State effective extraction region in text.
This step specifically refers to the associated description of S300 steps in one embodiment, and details are not described herein again.
The step of step S800, can refer to the feelings for including Statistical Operator in one embodiment in information extraction rules The associated description of S410 under condition, details are not described herein again.
Similarly with one embodiment, if information extraction rules include the Statistical Operator and characterization for characterizing the first model Any of the Statistical Operator of second model, and it includes the Statistical Operator and table for characterizing the first model that region, which determines rule only, Another in the Statistical Operator of the second model is levied, then the step of S600 includes:
S601:Using the name entity in text described in the first Model Identification, the name entity mark identified for each Remember the first label;
S602:Using the interdependent ingredient effectively extracted described in the second Model Identification in region, for each identify according to It is saved as minute mark and remembers the second label.
Alternatively, the step of S600, includes:
S603:Using the interdependent ingredient in text described in the second Model Identification, identified for each interdependent at minute mark Remember the second label;
S604:Using the name entity effectively extracted described in the first Model Identification in region, the life identified for each Name the first label of entity indicia.
In this way, the statistical model without being characterized all Statistical Operators extracted in expression formula is all used for To text identification one time, but first the statistical model that region determines that the Statistical Operator in rule be characterized can be used to know text It other one time, after determining effectively extraction region, then is determined in rule with include in information extraction rules, region and does not include The statistical model that Statistical Operator is characterized is known to reduce required for the statistical model of part to effectively extracting region recognition one time The length of other text promotes recognition speed, and then promotes information extraction speed.
Optionally, similarly with one embodiment, the type of the first label may include name label, place name label and Mechanism label, the type of the second label include core component label, interdependent word label, agent ingredient label and word denoting the receiver of an action into minute mark Label.Either region determine rule or information extraction rules in, can include regular expression, business factor concept and/ Or generic concept.When region determines that rule includes in regular expression, business factor concept, generic concept, Statistical Operator When one or more, different regular expressions, business factor concept, generic concept and/or Statistical Operator can be combined, There is sequencing relationship and/or logical operation relationship i.e. between them.The correlation specifically referred in one embodiment is retouched It states, details are not described herein again.
In the third embodiment of the application, a kind of information extraction dress corresponding with aforementioned information abstracting method is provided It sets, referring to FIG. 8, in the first realization method, including:
First acquisition unit 1, text and extraction expression formula, the extraction expression formula for obtaining information to be extracted include Region determines rule and information extraction rules, the region determine in rule to include Statistical Operator, and the Statistical Operator characterization is used In the statistical model for identifying name entity and/or interdependent ingredient in text;
First processing units 2, for identifying name entity and/or interdependent ingredient in the text using statistical model, Corresponding identification label is marked respectively for the name entity identified and/or interdependent ingredient;Institute is compared using the identification label It states region and determines rule and the text, determine effective extraction region in the text;And from effective extraction region In extract and the matched character string of described information decimation rule.
Optionally, the first processing units 2 are specifically used for including the feelings of Statistical Operator in described information decimation rule Under condition, using the identification label, extracted and the matched character of described information decimation rule from effective extraction region String.
Optionally, the statistical model includes the first model of name entity and for identification interdependent ingredient for identification Second model, the identification label include the first label and the second label;The first processing units 2 are specifically additionally operable to described Information extraction rules include any of the Statistical Operator for the second model of Statistical Operator and characterization for characterizing the first model, and In addition the region determines in Statistical Operator of the rule only including the second model of Statistical Operator and characterization for characterizing the first model It is each using name entity/interdependent ingredient in text described in the Model Identification of the first model/second in the case of one The label of the name entity identified/the first label of interdependent ingredient label/second;And utilize the Model Identification of the second model/first Effective interdependent ingredient/name entity extracted in region, the interdependent ingredient/name entity indicia identified for each the The label of two labels/first.
Optionally, the type of first label includes name label, place name label and mechanism label, second label Type include core component label, interdependent word label, agent ingredient label and word denoting the receiver of an action ingredient label.
The first processing units 2 specifically be additionally operable to the name entity gone out using first Model Identification be name, In the case of place name or mechanism, for the corresponding name label of name entity indicia, place name label or the mechanism mark identified Label;It is core component, the feelings of interdependent word, agent ingredient or word denoting the receiver of an action ingredient in the interdependent ingredient gone out using second Model Identification Under condition, for the interdependent ingredient identified mark corresponding core component label, interdependent word label, agent ingredient label or by Thing ingredient label;It compares the region and determines rule and the text;And rule and the text are determined according to the region Matched position determines and effectively extracts region.Wherein, if the region determines that the Statistical Operator in rule is carried specified The type of the label of label and first label/second matches, then the Statistical Operator with mark first label/the second The string matching of label, the specified label be used for characterize user it is expected identified from text name entity type or The type of interdependent ingredient.
Optionally, the first processing units 2 are specific is additionally operable to compare described information decimation rule and effective extraction Region, and, it extracts and the matched character string of described information decimation rule.Wherein, if system in described information decimation rule The type for calculating the label of the specified label that is carried of son and first label/second matches, then the Statistical Operator with mark Remember the string matching of the label of first label/second.
Optionally, described information decimation rule or region determine that rule further includes regular expression, wherein the statistics is calculated It is sub that there is sequencing relationship and/or logical operation relationship between the regular expression.The region determines rule or institute It further includes business factor concept/generic concept to state information extraction rules, the business factor concept/generic concept with it is described Statistical Operator, or there is sequencing relationship and/or logical operation relationship between the regular expression.
Referring to FIG. 9, in second of realization method, which includes:
Second acquisition unit 3, text and extraction expression formula, the extraction expression formula for obtaining information to be extracted include Region determines rule and information extraction rules, includes Statistical Operator in described information decimation rule, and the Statistical Operator characterization is used In the statistical model for identifying name entity and/or interdependent ingredient in text;
Second processing unit 4, for identifying name entity and/or interdependent ingredient in the text using statistical model, Corresponding identification label is marked respectively for the name entity identified and/or interdependent ingredient;Determine rule really using the region Effective extraction region in the fixed text;And using the identification label, extracted from effective extraction region with The matched character string of described information decimation rule.
Second processing unit 4 specifically can mutually be referred to the first realization method, and details are not described herein again.Above-mentioned information Draw-out device is corresponding with the information extraction method in one embodiment and second embodiment, has and is extracted with aforementioned information The corresponding advantageous effect of method, also repeats no more herein.
The same or similar parts between the embodiments can be referred to each other in this specification.Invention described above is real The mode of applying is not intended to limit the scope of the present invention..

Claims (10)

1. a kind of information extraction method, which is characterized in that including:
It obtains the text of information to be extracted and extracts expression formula, the extraction expression formula includes that region determines rule and information extraction Rule, the region determine in rule to include Statistical Operator, and the Statistical Operator characterizes the name entity in text for identification And/or the statistical model of interdependent ingredient;
Identify name entity and/or interdependent ingredient in the text using statistical model, for the name entity that identifies and/or Interdependent ingredient marks corresponding identification label respectively;
The region is compared using the identification label and determines rule and the text, determines effective extraction area in the text Domain;
It is extracted and the matched character string of described information decimation rule from effective extraction region.
2. according to the method described in claim 1, it is characterized in that, including Statistical Operator in described information decimation rule;
The step of extracting character string matched with information extraction rules from effective extraction region, specifically includes:
Using the identification label, extracted and the matched character of described information decimation rule from effective extraction region String.
3. according to the method described in claim 2, it is characterized in that, the statistical model includes the of name entity for identification Second model of one model and for identification interdependent ingredient, the identification label include the first label and the second label;
If the region determines the Statistical Operator of Statistical Operator and characterization second model of the rule only including the first model of characterization Any of, and described information decimation rule includes the statistics for the second model of Statistical Operator and characterization for characterizing the first model Another in operator, then identify name entity and/or interdependent ingredient in the text with statistical model, identify The step of name entity and/or interdependent ingredient mark corresponding identification label respectively, specifically includes:
Using name entity/interdependent ingredient in text described in the Model Identification of the first model/second, the life identified for each The name label of entity/the first label of interdependent ingredient label/second;
Using the interdependent ingredient/name entity effectively extracted described in the Model Identification of the second model/first in region, for each knowledge The label of the interdependent ingredient not gone out/the second label of name entity indicia/first.
4. according to the method described in claim 3, it is characterized in that, the type of first label includes name label, place name Label and mechanism label, the type of second label include core component label, interdependent word label, agent ingredient label and by Thing ingredient label;
The step of name the first label of entity indicia identified for each, including:
If being name, place name or mechanism using the name entity that first Model Identification goes out, for the life identified The corresponding name label of name entity indicia, place name label or mechanism label;
The step of interdependent the second label of ingredient label identified for each, including:
If the interdependent ingredient gone out using second Model Identification is core component, interdependent word, agent ingredient or word denoting the receiver of an action ingredient, It is then that the interdependent ingredient identified marks corresponding core component label, interdependent word label, agent ingredient label or word denoting the receiver of an action Ingredient label;
The region is compared using the identification label and determines rule and the text, determines effective extraction area in the text The step of domain, including:
It compares the region and determines rule and the text, wherein if the region determines the Statistical Operator institute band in rule The type of the label of the specified label having and first label/second matches, then the Statistical Operator and label described first The string matching of the label of label/second, the specified label are used to characterize user and it is expected that the name identified from text is real The type of the type of body or interdependent ingredient;
The position of rule and the text matches is determined according to the region, is determined and is effectively extracted region.
5. according to the method described in claim 4, it is characterized in that, using the identification label, from effective extraction region In the step of extracting character string matched with described information decimation rule, including:
Compare described information decimation rule and effective extraction region, wherein if the statistics in described information decimation rule The type of the label of the specified label that operator is carried and first label/second matches, then the Statistical Operator and label The string matching of the label of first label/second;
It extracts and the matched character string of described information decimation rule.
6. according to the method described in claim 1, it is characterized in that, the region determines that rule further includes regular expression, In, there is sequencing relationship and/or logical operation relationship between the Statistical Operator and the regular expression.
7. according to the method described in claim 6, it is characterized in that, the region determines rule or described information decimation rule also Including business factor concept/generic concept, the business factor concept/generic concept and the Statistical Operator, or with institute Stating has sequencing relationship and/or logical operation relationship between regular expression.
8. a kind of information extraction method, which is characterized in that including:
It obtains the text of information to be extracted and extracts expression formula, the extraction expression formula includes that region determines rule and information extraction Rule includes Statistical Operator in described information decimation rule, the Statistical Operator characterization name entity in text for identification And/or the statistical model of interdependent ingredient;
Identify name entity and/or interdependent ingredient in the text using statistical model, for the name entity that identifies and/or Interdependent ingredient marks corresponding identification label respectively;
Determine that rule determines effective extraction region in the text using the region;
Using the identification label, extracted and the matched character of described information decimation rule from effective extraction region String.
9. a kind of information extraction device, which is characterized in that including:
First acquisition unit, the text for obtaining information to be extracted and extraction expression formula, the extraction expression formula includes region Determine rule and information extraction rules, the region determine in rule to include Statistical Operator, the Statistical Operator characterization is for knowing The statistical model of name entity and/or interdependent ingredient in other text;
First processing units, for identifying name entity and/or interdependent ingredient in the text using statistical model, for identification The name entity and/or interdependent ingredient gone out marks corresponding identification label respectively;The region is compared using the identification label It determines rule and the text, determines effective extraction region in the text;And it is extracted from effective extraction region Go out and the matched character string of described information decimation rule.
10. a kind of information extraction device, which is characterized in that including:
Second acquisition unit, the text for obtaining information to be extracted and extraction expression formula, the extraction expression formula includes region It determines rule and information extraction rules, includes Statistical Operator in described information decimation rule, the Statistical Operator characterization is for knowing The statistical model of name entity and/or interdependent ingredient in other text;
Second processing unit, for identifying name entity and/or interdependent ingredient in the text using statistical model, for identification The name entity and/or interdependent ingredient gone out marks corresponding identification label respectively;It is determined described in regular determine using the region Effective extraction region in text;And it using the identification label, is extracted and the letter from effective extraction region Cease the matched character string of decimation rule.
CN201810401030.1A 2018-04-28 2018-04-28 Information extraction method and device Active CN108647194B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810401030.1A CN108647194B (en) 2018-04-28 2018-04-28 Information extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810401030.1A CN108647194B (en) 2018-04-28 2018-04-28 Information extraction method and device

Publications (2)

Publication Number Publication Date
CN108647194A true CN108647194A (en) 2018-10-12
CN108647194B CN108647194B (en) 2022-04-19

Family

ID=63748759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810401030.1A Active CN108647194B (en) 2018-04-28 2018-04-28 Information extraction method and device

Country Status (1)

Country Link
CN (1) CN108647194B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684631A (en) * 2018-12-12 2019-04-26 北京神州泰岳软件股份有限公司 Name entity abstracting method, device and medium
CN109858040A (en) * 2019-03-05 2019-06-07 腾讯科技(深圳)有限公司 Name entity recognition method, device and computer equipment
CN109918639A (en) * 2018-12-13 2019-06-21 北京海致星图科技有限公司 A kind of bank's credit text resolution method based on depth learning technology and rule base
CN110046349A (en) * 2019-03-26 2019-07-23 平安科技(深圳)有限公司 Information identifying method, device, equipment and storage medium based on Chinese case history
CN110188203A (en) * 2019-06-10 2019-08-30 北京百度网讯科技有限公司 Text polymerization, device, equipment and storage medium
CN111459973A (en) * 2020-06-16 2020-07-28 四川大学 Case type retrieval method and system based on case situation triple information
CN113158677A (en) * 2021-05-13 2021-07-23 竹间智能科技(上海)有限公司 Named entity identification method and system
CN113822013A (en) * 2021-03-08 2021-12-21 京东科技控股股份有限公司 Labeling method and device for text data, computer equipment and storage medium

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080310718A1 (en) * 2007-06-18 2008-12-18 International Business Machines Corporation Information Extraction in a Natural Language Understanding System
CA3094442A1 (en) * 2008-01-30 2009-08-06 Thomson Reuters Enterprise Centre Gmbh Financial event and relationship extraction
CN101853292A (en) * 2010-05-18 2010-10-06 深圳市北科瑞讯信息技术有限公司 Method and system for constructing business social network
CN102750316A (en) * 2012-04-25 2012-10-24 北京航空航天大学 Concept relation label drawing method based on semantic co-occurrence model
US20120303661A1 (en) * 2011-05-27 2012-11-29 International Business Machines Corporation Systems and methods for information extraction using contextual pattern discovery
US8682906B1 (en) * 2013-01-23 2014-03-25 Splunk Inc. Real time display of data field values based on manual editing of regular expressions
CN103823859A (en) * 2014-02-21 2014-05-28 安徽博约信息科技有限责任公司 Name recognition algorithm based on combination of decision-making tree rules and multiple statistic models
CN103838870A (en) * 2014-03-21 2014-06-04 武汉科技大学 News atomic event extraction method based on information unit fusion
US20150081279A1 (en) * 2013-09-19 2015-03-19 Maluuba Inc. Hybrid natural language processor
CN104572628A (en) * 2015-02-05 2015-04-29 《中国学术期刊(光盘版)》电子杂志社有限公司 System and method for automatically extracting academic definition based on syntax characteristics
CN104933027A (en) * 2015-06-12 2015-09-23 华东师范大学 Open Chinese entity relation extraction method using dependency analysis
CN105930509A (en) * 2016-05-11 2016-09-07 华东师范大学 Method and system for automatic extraction and refinement of domain concept based on statistics and template matching
CN105938495A (en) * 2016-04-29 2016-09-14 乐视控股(北京)有限公司 Entity relationship recognition method and apparatus
CN106156083A (en) * 2015-03-31 2016-11-23 联想(北京)有限公司 A kind of domain knowledge processing method and processing device
CN106776866A (en) * 2016-11-29 2017-05-31 首都师范大学 A kind of method that meeting original text on University Websites carries out Knowledge Extraction
CN107368470A (en) * 2017-06-27 2017-11-21 北京神州泰岳软件股份有限公司 A kind of method and apparatus for extracting enterprises organizational structure information
CN107423279A (en) * 2017-04-11 2017-12-01 美林数据技术股份有限公司 A kind of information extraction and analysis method of credit financing short message
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model
CN107729480A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device of limited area
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080310718A1 (en) * 2007-06-18 2008-12-18 International Business Machines Corporation Information Extraction in a Natural Language Understanding System
CA3094442A1 (en) * 2008-01-30 2009-08-06 Thomson Reuters Enterprise Centre Gmbh Financial event and relationship extraction
CN101853292A (en) * 2010-05-18 2010-10-06 深圳市北科瑞讯信息技术有限公司 Method and system for constructing business social network
US20120303661A1 (en) * 2011-05-27 2012-11-29 International Business Machines Corporation Systems and methods for information extraction using contextual pattern discovery
CN102750316A (en) * 2012-04-25 2012-10-24 北京航空航天大学 Concept relation label drawing method based on semantic co-occurrence model
US8682906B1 (en) * 2013-01-23 2014-03-25 Splunk Inc. Real time display of data field values based on manual editing of regular expressions
US20150081279A1 (en) * 2013-09-19 2015-03-19 Maluuba Inc. Hybrid natural language processor
CN103823859A (en) * 2014-02-21 2014-05-28 安徽博约信息科技有限责任公司 Name recognition algorithm based on combination of decision-making tree rules and multiple statistic models
CN103838870A (en) * 2014-03-21 2014-06-04 武汉科技大学 News atomic event extraction method based on information unit fusion
CN104572628A (en) * 2015-02-05 2015-04-29 《中国学术期刊(光盘版)》电子杂志社有限公司 System and method for automatically extracting academic definition based on syntax characteristics
CN106156083A (en) * 2015-03-31 2016-11-23 联想(北京)有限公司 A kind of domain knowledge processing method and processing device
CN104933027A (en) * 2015-06-12 2015-09-23 华东师范大学 Open Chinese entity relation extraction method using dependency analysis
CN105938495A (en) * 2016-04-29 2016-09-14 乐视控股(北京)有限公司 Entity relationship recognition method and apparatus
CN105930509A (en) * 2016-05-11 2016-09-07 华东师范大学 Method and system for automatic extraction and refinement of domain concept based on statistics and template matching
CN106776866A (en) * 2016-11-29 2017-05-31 首都师范大学 A kind of method that meeting original text on University Websites carries out Knowledge Extraction
CN107423279A (en) * 2017-04-11 2017-12-01 美林数据技术股份有限公司 A kind of information extraction and analysis method of credit financing short message
CN107368470A (en) * 2017-06-27 2017-11-21 北京神州泰岳软件股份有限公司 A kind of method and apparatus for extracting enterprises organizational structure information
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model
CN107729480A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device of limited area
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HALIL KILICOGLU: "EFFECTIVE BIO-EVENT EXTRACTION USING TRIGGER WORDS AND SYNTACTIC DEPENDENCIES", 《COMPUTATIONAL INTELLIGENCE》 *
HASSAN H. MALIK: "Accurate information extraction for quantitative financial events", 《PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT》 *
陈倩: "基于特征模型的跨领域信息抽取方法研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
魏小梅: "生物事件抽取联合模型研究", 《中国优秀博硕士学位论文全文数据库(博士)信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684631A (en) * 2018-12-12 2019-04-26 北京神州泰岳软件股份有限公司 Name entity abstracting method, device and medium
CN109918639A (en) * 2018-12-13 2019-06-21 北京海致星图科技有限公司 A kind of bank's credit text resolution method based on depth learning technology and rule base
CN109918639B (en) * 2018-12-13 2024-02-13 北京海致星图科技有限公司 Bank credit text analysis method based on deep learning technology and rule base
CN109858040A (en) * 2019-03-05 2019-06-07 腾讯科技(深圳)有限公司 Name entity recognition method, device and computer equipment
CN110046349A (en) * 2019-03-26 2019-07-23 平安科技(深圳)有限公司 Information identifying method, device, equipment and storage medium based on Chinese case history
CN110188203A (en) * 2019-06-10 2019-08-30 北京百度网讯科技有限公司 Text polymerization, device, equipment and storage medium
CN111459973A (en) * 2020-06-16 2020-07-28 四川大学 Case type retrieval method and system based on case situation triple information
CN111459973B (en) * 2020-06-16 2020-10-23 四川大学 Case type retrieval method and system based on case situation triple information
CN113822013A (en) * 2021-03-08 2021-12-21 京东科技控股股份有限公司 Labeling method and device for text data, computer equipment and storage medium
CN113822013B (en) * 2021-03-08 2024-04-05 京东科技控股股份有限公司 Labeling method and device for text data, computer equipment and storage medium
CN113158677A (en) * 2021-05-13 2021-07-23 竹间智能科技(上海)有限公司 Named entity identification method and system
CN113158677B (en) * 2021-05-13 2023-04-07 竹间智能科技(上海)有限公司 Named entity identification method and system

Also Published As

Publication number Publication date
CN108647194B (en) 2022-04-19

Similar Documents

Publication Publication Date Title
CN108647194A (en) information extraction method and device
US10997370B2 (en) Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
US6631346B1 (en) Method and apparatus for natural language parsing using multiple passes and tags
CN105869642B (en) A kind of error correction method and device of speech text
Tur et al. What is left to be understood in ATIS?
US9280967B2 (en) Apparatus and method for estimating utterance style of each sentence in documents, and non-transitory computer readable medium thereof
CN106649825B (en) Voice interaction system and creation method and device thereof
CN109582949A (en) Event element abstracting method, calculates equipment and storage medium at device
CN110379445A (en) Method for processing business, device, equipment and storage medium based on mood analysis
US6963831B1 (en) Including statistical NLU models within a statistical parser
EP1205852A2 (en) Including grammars within a statistical parser
US20060253273A1 (en) Information extraction using a trainable grammar
Drovo et al. Named entity recognition in Bengali text using merged hidden Markov model and rule base approach
WO2021129123A1 (en) Corpus data processing method and apparatus, server, and storage medium
CN111489746B (en) Power grid dispatching voice recognition language model construction method based on BERT
Adel et al. Features for factored language models for code-Switching speech.
US20220156582A1 (en) Generating Knowledge Graphs From Conversational Data
CN106940726A (en) The intention automatic generation method and terminal of a kind of knowledge based network
CN108304373A (en) Construction method, device, storage medium and the electronic device of semantic dictionary
WO2012165529A1 (en) Language model construction support device, method and program
CN109598517A (en) Commodity clearance processing, the processing of object and its class prediction method and apparatus
CN111309876A (en) Service request processing method and device, electronic equipment and storage medium
Ek et al. Identifying speakers and addressees in dialogues extracted from literary fiction
US20230245654A1 (en) Systems and Methods for Implementing Smart Assistant Systems
CN106484678A (en) A kind of short text similarity calculating method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181012

Assignee: Zhongke Dingfu (Beijing) Science and Technology Development Co., Ltd.

Assignor: Beijing Shenzhou Taiyue Software Co., Ltd.

Contract record no.: X2019990000214

Denomination of invention: Method and device for extracting webpage information

License type: Exclusive License

Record date: 20191127

EE01 Entry into force of recordation of patent licensing contract
CB02 Change of applicant information

Address after: Room 818, 8 / F, 34 Haidian Street, Haidian District, Beijing 100080

Applicant after: BEIJING ULTRAPOWER SOFTWARE Co.,Ltd.

Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building block A Room 601

Applicant before: BEIJING ULTRAPOWER SOFTWARE Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant