CN108647194B

CN108647194B - Information extraction method and device

Info

Publication number: CN108647194B
Application number: CN201810401030.1A
Authority: CN
Inventors: 李德彦; 晋耀红; 吴相博
Original assignee: Ultrapower Software Co ltd
Current assignee: Ultrapower Software Co ltd
Priority date: 2018-04-28
Filing date: 2018-04-28
Publication date: 2022-04-19
Anticipated expiration: 2038-04-28
Also published as: CN108647194A

Abstract

The embodiment of the invention discloses an information extraction method and a device, wherein the method comprises the following steps: acquiring a text of information to be extracted and an extraction expression, wherein the extraction expression comprises a region determination rule and an information extraction rule, the region determination rule comprises a statistical operator, and the statistical operator represents a statistical model for identifying a named entity and/or a dependency component in the text; identifying the named entities and/or the dependency components in the text by using a statistical model, and respectively marking corresponding identification tags for the identified named entities and/or dependency components; comparing the area determination rule with the text by using the identification tag, and determining an effective extraction area in the text; and extracting the character strings matched with the information extraction rule from the effective extraction area. The method calls the statistical model in a regular mode, is convenient and flexible, expands the range of recognized words, reduces rule construction, and more accurately extracts information required by a user.

Description

Information extraction method and device

Technical Field

The invention relates to the field of text processing and information extraction, in particular to an information extraction method. In addition, the invention also relates to an information extraction device.

Background

Information Extraction (Information Extraction) is a text processing technique that extracts factual Information such as entities, relationships, events, etc. of a specified type from a natural language text and forms structured data output. The method can be used as a preposed information processing flow of operations such as intelligent question answering, semantic information deep mining, normalized information extraction and the like.

The method mainly adopted for information extraction is a rule-based extraction method, and generally comprises two stages: and constructing a regular expression, and applying the regular expression to obtain the information required by the user. The rule expression is constructed mainly by modeling personnel according to extraction requirements and experience. The plurality of regular expressions are organized in a particular form, which may be referred to as a rule model. And matching the rule expression in the rule model with the text to extract the information required by the user from the text.

A good rule model can reach higher standards in accuracy and precision, but the rule model is constructed by professional modeling personnel and exhaustive text elements needing to be matched, so that a large amount of labor and time are consumed. For example, if place names are required to be used as building elements of a regular expression, namely place names are required to be accurately identified from texts, and omission is reduced to the minimum, all place names are required to be exhausted by modelers. Therefore, it takes a lot of labor and time to construct a rule model for information extraction, which is a problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In order to solve the technical problem, the application provides an information extraction method to reduce a large amount of manpower and time consumed by rule construction, and extract information from a text more comprehensively and accurately.

In a first aspect, an information extraction method is provided, including: acquiring a text of information to be extracted and an extraction expression, wherein the extraction expression comprises a region determination rule and an information extraction rule, the region determination rule comprises a statistical operator, and the statistical operator represents a statistical model for identifying a named entity and/or a dependency component in the text;

identifying the named entities and/or the dependency components in the text by using a statistical model, and respectively marking corresponding identification tags for the identified named entities and/or dependency components;

comparing the area determination rule with the text by using the identification tag, and determining an effective extraction area in the text;

and extracting the character strings matched with the information extraction rule from the effective extraction area.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the information extraction rule includes a statistical operator;

the step of extracting the character string matched with the information extraction rule from the effective extraction area specifically comprises the following steps:

and extracting the character strings matched with the information extraction rule from the effective extraction area by using the identification tags.

With reference to the first implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the statistical model includes a first model for identifying a named entity and a second model for identifying a dependency component, and the identification tag includes a first tag and a second tag;

if the region determination rule only includes any one of the statistical operator characterizing the first model and the statistical operator characterizing the second model, and the information extraction rule includes the other one of the statistical operator characterizing the first model and the statistical operator characterizing the second model, identifying the named entity and/or the dependency component in the text by using the statistical model, and marking the identified named entity and/or dependency component with the corresponding identification tag respectively, the method specifically includes the steps of:

identifying named entities/dependency components in the text by using the first model/the second model, and marking a first label/a second label for each identified named entity/dependency component;

and identifying the dependent components/named entities in the effective extraction area by using the second model/the first model, and marking a second label/a first label for each identified dependent component/named entity.

With reference to the first aspect and the foregoing possible implementation manners, in a third possible implementation manner of the first aspect, the category of the first tag includes a person name tag, a place name tag, and an organization tag, and the category of the second tag includes a core component tag, a dependency word tag, an event component tag, and an event component tag;

the step of tagging each identified named entity with a first tag, comprising:

if the named entity identified by the first model is a person name, a place name or a mechanism, marking a corresponding person name label, place name label or mechanism label for the identified named entity;

the step of tagging a second tag for each identified dependent component includes:

if the dependency component identified by the second model is a core component, a dependency word, an event component or an event component, marking the identified dependency component with a corresponding core component label, dependency word label, event component label or event component label;

comparing the area determination rule with the text by using the identification tag, and determining an effective extraction area in the text, wherein the step comprises the following steps:

comparing the region determination rule with the text, wherein if a specified label carried by a statistical operator in the region determination rule is matched with the type of the first label/the second label, the statistical operator is matched with the character string marking the first label/the second label, and the specified label is used for representing the type of the named entity or the type of the dependent component which is expected to be identified from the text by the user;

and determining an effective extraction area according to the position of the area determination rule matched with the text.

With reference to the first aspect and the foregoing possible implementation manners, in a fourth possible implementation manner of the first aspect, the step of extracting, by using the identification tag, a character string that matches the information extraction rule from the effective extraction area includes:

comparing the information extraction rule with the effective extraction area, wherein if the specified label carried by a statistical operator in the information extraction rule is matched with the type of the first label/the second label, the statistical operator is matched with the character string marking the first label/the second label;

and extracting the character strings matched with the information extraction rules.

With reference to the first aspect and the foregoing possible implementation manners, in a fourth possible implementation manner of the first aspect, the region determination rule further includes a regular expression, where the statistical operator and the regular expression have a sequential relationship and/or a logical operation relationship.

With reference to the first aspect and the foregoing possible implementation manners, in a fourth possible implementation manner of the first aspect, the region determining rule or the information extracting rule further includes a service element concept/a general concept, and a precedence relationship and/or a logical operation relationship between the service element concept/the general concept and the statistical operator or the regular expression are/is provided.

In a second aspect, an information extraction method is provided, including:

acquiring a text of information to be extracted and an extraction expression, wherein the extraction expression comprises an area determination rule and an information extraction rule, the information extraction rule comprises a statistical operator, and the statistical operator represents a statistical model for identifying a named entity and/or a dependency component in the text;

determining an effective extraction area in the text by using the area determination rule;

In a third aspect, an information extraction apparatus is provided, including:

the information extraction method comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a text of information to be extracted and an extraction expression, the extraction expression comprises a region determination rule and an information extraction rule, the region determination rule comprises a statistical operator, and the statistical operator represents a statistical model used for identifying named entities and/or dependency components in the text;

the first processing unit is used for identifying the named entities and/or the dependency components in the text by using a statistical model, and marking the identified named entities and/or the dependency components with corresponding identification tags respectively; comparing the area determination rule with the text by using the identification tag, and determining an effective extraction area in the text; and extracting the character string matched with the information extraction rule from the effective extraction area.

In a fourth aspect, there is provided an information extraction apparatus comprising:

the second obtaining unit is used for obtaining a text of the information to be extracted and an extraction expression, wherein the extraction expression comprises an area determination rule and an information extraction rule, the information extraction rule comprises a statistical operator, and the statistical operator represents a statistical model used for identifying a named entity and/or a dependency component in the text;

the second processing unit is used for identifying the named entities and/or the dependency components in the text by using a statistical model, and marking the identified named entities and/or the dependency components with corresponding identification tags respectively; determining an effective extraction area in the text by using the area determination rule; and extracting the character string matched with the information extraction rule from the effective extraction area by using the identification label.

According to the information extraction method, firstly, a text of information to be extracted and an extraction expression are obtained, the extraction expression comprises a region determination rule and an information extraction rule, the region determination rule and/or the information extraction rule comprise a statistical operator, so that a statistical model for identifying a named entity and/or a dependency component is defined as the statistical operator and is introduced into the extraction expression to obtain the extraction expression. And then, identifying the named entities and/or the dependent components in the text by using a statistical model, and marking the corresponding identification tags for the identified named entities and/or dependent components respectively. Comparing the area determination rule with the text by using the identification tag, determining an effective extraction area in the text, and extracting a character string matched with the information extraction rule from the effective extraction area; or, determining an effective extraction area in the text by using the area determination rule, and extracting a character string matched with the information extraction rule from the effective extraction area by using the identification tag. In this way, the statistical model for identifying the named entity and/or the dependency component is called in a regular mode, so that the statistical model participates in the process of extracting the matching of the expression and the text, and the statistical model is very convenient and flexible to use. Compared with a simple regular expression, the method has the advantages that the range of recognized words is enlarged, information required by a user can be extracted more comprehensively, and meanwhile, a large amount of labor and time are avoided being consumed when the regular expression is constructed; compared with a method based on a single statistical model, the method can more accurately extract the information required by the user.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a flow chart of a first embodiment of an information extraction method of the present application;

FIG. 2 is a flowchart of a specific implementation manner of a first embodiment of the information extraction method of the present application;

FIG. 3 is a flowchart of a second specific implementation manner of the first embodiment of the information extraction method of the present application;

fig. 4 is a flowchart of a third specific implementation manner of the first embodiment of the information extraction method of the present application;

fig. 5 is a flowchart of one implementation manner of the step S300 in the first embodiment of the information extraction method of the present application;

fig. 6 is a flowchart of one implementation manner of the step S410 in the first embodiment of the information extraction method of the present application;

FIG. 7 is a flow chart of a second embodiment of the information extraction method of the present application;

FIG. 8 is a schematic structural diagram of an embodiment of an information extraction device according to the present application;

fig. 9 is a schematic configuration diagram of a second embodiment of the information extraction device according to the present application.

Detailed Description

The following provides a detailed description of the embodiments of the present application.

In the rule-based extraction method, the rule expression includes an information extraction rule for extracting information in text that a user desires to extract. For example, the information extraction rule "medium stature | body type general" is matched with the text, and when the text shows "medium stature" or "body type general", the information of the body type in the text is extracted. In order to extract information more comprehensively, modeling personnel need to exhaust all possible expression forms one by one to construct a regular expression, and a great deal of labor and time are consumed.

In addition to rule-based extraction methods, information may also be extracted using statistical-based extraction methods. That is, a corpus in which information that a user wishes to extract is marked is used to train a statistical model, such as a Hidden Markov Model (HMM), a maximum entropy model (MEMM), a conditional random field model (CRF), a support vector machine model (SVM), etc., and then the trained statistical model is used to extract information. By adopting the statistical-based extraction method, no professional modeling personnel are needed to construct the rule expression, so that the labor and the time are saved. However, the statistical-based extraction method is generally inferior in accuracy and precision compared to the rule-based extraction method. On the one hand, the accuracy of the application of the statistical model is affected by the fact that the training corpus is not comprehensive enough; on the other hand, when the extraction requirement of the user is complex, and not only the named entity which is good for the statistical model is simply extracted, etc., the accuracy of the extraction method based on the statistical model in application is also affected.

Therefore, the application provides a new information extraction method, wherein a statistical model for identifying the named entity and/or the dependency component is defined as a statistical operator and is introduced into the regular expression, so as to obtain the extraction expression. The statistical model for identifying the named entity and/or the dependency component is called in a regular mode, so that the statistical model participates in the process of matching with the text, and the statistical model is very convenient and flexible to use. The extraction expression is used for processing the text, compared with a simple rule expression, the range of recognized words is enlarged, information required by a user can be extracted more comprehensively, and meanwhile, a large amount of manpower and time are avoided being consumed when the rule is constructed; compared with a method based on a single statistical model, the method can more accurately extract the information required by the user.

The decimation expression in this application includes two parts: region determination rules and information extraction rules. The statistical operator for representing the statistical model can be introduced into the region determination rule only, the information extraction rule only, or both the region determination rule and the information extraction rule. For clarity of illustration, the three cases will be described below by two embodiments: in the first embodiment, the region determination rule includes a statistical operator, and the information extraction rule may or may not include a statistical operator; in the second embodiment, the information extraction rule includes a statistical operator, and the region determination rule may or may not include a statistical operator.

Referring to fig. 1, in a first embodiment, an information extraction method is provided, which includes the following steps S100-S400.

S100: the method comprises the steps of obtaining a text of information to be extracted and an extraction expression, wherein the extraction expression comprises an area determination rule and an information extraction rule, the area determination rule comprises a statistical operator, and the statistical operator represents a statistical model used for identifying named entities and/or dependency components in the text.

In the present application, the text of the information to be extracted may be a text from the internet, or a text from a specific database, and the source and form of the text of the information to be extracted are not limited in the present application.

The named entities are (named entity) which are names of people, organizations, places and other entities identified by names, and the more extensive entities can also include numbers, dates, currency, addresses and the like.

The dependency component refers to a syntactic component included in a sentence, such as a core component, a dependency word, an action component, a subject component, and the like. In a sentence, there is a relationship between words and their orientation, the dominant word is called dominant word, i.e. the core component in the present application, and the dominant word is called dependent word, i.e. the dependent word in the present application. Generally, verbs are used as the centers of sentences, and dominate other components in the sentences, that is, the components depend on the verbs in various dependency relationships, and the relationships are unidirectional. Besides syntactic components such as core components and dependency words, semantic roles including predicates (verbs or nouns) such as actors and receivers can be analyzed in one sentence, each semantic role is endowed with a certain semantic meaning, the actors in the sentence are the actor components in the application, and the receivers are the receiver components in the application.

The statistical operators characterize the statistical model used to identify the named entities and/or dependency components in the text, that is, the statistical model is represented in the form of statistical operators, thereby facilitating application to the extraction expressions. The statistical model is a model that has been trained with labeled corpus, i.e., a statistical model with determined parameters.

The decimation expression includes two parts: region determination rules and information extraction rules. The region determination rule is used to determine an effective extraction region in the text. In one embodiment, the region determination rules may include a pre-positioning rule for determining a start position in the text and a post-positioning rule for determining an end position in the text. After the starting position and the ending position are determined, the text between the two positions is the effective extraction area. In this case, at least one of the prepositioning rule and the postpositioning rule contains a statistical operator, and the positioning rule is considered to contain the statistical operator. In another embodiment, the region determination rule may include a center positioning rule, where the center positioning rule is used to determine a center position in the text, and then the preset region is expanded to the context according to the center position, so as to determine the effective extraction region.

For a region determination rule including a statistical operator, it may include only the statistical operator, or may include both the statistical operator and a regular expression. When the statistical operator and the regular expression are included, the statistical operator and the regular expression have a precedence relationship and/or a logical operation relationship. For example, the prepositioning rule may be in the form of "PD is very beautiful", where PD is an entity recognition operator, representing a statistical model for recognizing named entities, and "very beautiful" is a regular expression. In this example, the statistical operator and the canonical expression have a precedence relationship therebetween. The rule can match to a string like "king jane beauty". For another example, the prepositioning rule may also be in the form of "PD (very beautiful | very beautiful)", where "very beautiful" and "very beautiful" are regular expressions, and there is a logical operation relationship "or" between the two, and there is a sequential relationship between PD and the whole (very beautiful | very beautiful). That is, there may be a precedence relationship and/or a logical operation relationship between the statistical operator and the statistical operator, between the regular expression and the regular expression, or between the statistical operator and the regular expression.

The information extraction rule is used for extracting information which is expected to be extracted by a user in the effective extraction area. In this embodiment, the information extraction rule may only include a regular expression, or may include a regular expression and a statistical operator. When the regular expression and the statistical operator are included, the regular expression and the statistical operator have a precedence relationship and/or a logical operation relationship, which are similar to the foregoing, and are not described herein again.

In one embodiment, the information extraction rule may be between the preposition rule and the post-positioning rule, and the two are separated by "@", that is, the extraction expression is "preposition rule @ information extraction rule @ post-positioning rule". Here, the pre-positioning rule or the post-positioning rule may be null. When the current positioning rule is empty, defaulting that the first character of the whole text is taken as the initial position; when the post-positioning rule is empty, the last character of the whole text is defaulted as the end position. The case where the pre-positioning rule or the post-positioning rule is empty can be regarded as a special case where the positioning rule only contains regular expressions.

Optionally, the statistical model includes a first model for identifying the named entity and a second model for identifying the dependency component. The statistical operators comprise an entity identification operator PD and a dependency component operator DC, the entity identification operator PD characterizes the first model, and the dependency component operator DC characterizes the second model.

The first model may employ an HMM model, a CRF model, or the like. In the training stage, the corpus with the labels is used for training, and the main parameters of the model are determined, so that a trained first model is obtained. In the using stage, the text to be recognized is input into the statistical model, and the named entity in the text to be recognized can be output. Similarly, the second model may also be trained by using an HMM model, an MEMM model, a CRF model, etc., except that the corpus with labels used for training is different from that used for training the first model, so that the main parameters of the trained model are also different, i.e., different second models are obtained. For the training of the first model and the second model, if the corpus with labels under different application scenes is used as the training corpus, the model parameters obtained by training are different, so that the trained statistical model can be more suitable for processing the corpus under the specific application scene. For example, if the training corpora are labeled financial news, the trained statistical model may be more suitable for processing the financial news, i.e., identifying named entities or dependency components from the financial news. The specific method for training the statistical model may be a method in the prior art, and is not described herein again.

For example, the decimation expression 1 is "PD @ (medium stature | body type general) @". In this example, the prepositioning rule only includes a statistical operator PD, when the prepositioning rule is matched with the text, a corresponding statistical model is called to identify a named entity in the text, and if the named entity is identified, the position is determined as a start position; the post-positioning rule is null, namely the last character of the text is taken as an end position; the information extraction rule is "(medium stature | body type general)", that is, in the effective extraction area, if "medium stature" or "body type general" is included, it is extracted.

S200: and identifying the named entities and/or the dependent components in the text by using a statistical model, and marking the identified named entities and/or the dependent components with corresponding identification tags respectively.

In the step S200, the named entity and/or the dependent component in the text are identified by using the statistical model, the text is used as input data of the statistical model, and the identified named entity and/or the dependent component is used as output data of the statistical model, which is performed by using a method in the prior art, and is not described herein again. Each identified named entity and/or dependency component in the text is labeled with a corresponding identification tag.

When the extraction expression includes a plurality of statistical operators, the named entities or dependency components identified by each statistical model may be labeled with different corresponding identification tags. For example, in one embodiment, referring to Table 1, the statistical operators may include an entity identification operator and a dependency operator, denoted "PD" and "DC", respectively. The statistical model may include a first model for identifying the named entity and a second model for identifying a dependency component, the entity identification operator characterizing the first model and the dependency component operator characterizing the second model. The identification tags comprise a first tag and a second tag, the contents identified by the first model are named entities and are marked with the first tag, the contents identified by the second model are all dependent components and are marked with the second tag.

TABLE 1 example of correspondence between statistical operators, statistical models, and identification tags

S300: and comparing the area determination rule with the text by using the identification tag, and determining an effective extraction area in the text.

Since there are many specific implementation forms of the region determination rule, there are some differences in the specific steps for determining the effective extraction region. For example, in one implementation, the region determination rule is a rule including a prepositioning rule and a postpositioning rule, at least one of which includes a statistical operator in the embodiment. The initial position can be determined by matching the preposed positioning rule with the text, and the end position can be determined by matching the postpositional positioning rule with the text. After the starting position and the ending position are determined, the text between the two positions is the effective extraction area. For another example, in another implementation, the region determination rule includes a center positioning rule, the center positioning rule includes a statistical operator, the center position can be determined by matching the center positioning rule with the text, and then the preset region is expanded to the context according to the center position, so as to determine the effective extraction region. However, in any implementation form, when the prepositioning rule and/or the postpositioning rule or the central positioning rule is compared and matched with the text, the corresponding identification tag is required to be used for judging whether the prepositioning rule, the postpositioning rule or the central positioning rule is matched with the text.

Specifically, if the prepositioning rule, the postpositioning rule or the central positioning rule only contains a statistical operator, as long as the statistical model corresponding to one identification tag and the statistical model represented by the statistical operator are the same model, it can be considered that the character string (named entity or dependency component) marked by the identification tag matches the statistical operator in the rule. Because the preposed positioning rule, the postpositional positioning rule or the central positioning rule only contains a statistical operator, the rule is matched with the character string marked by the identification label. If the prepositioning rule, the postpositioning rule or the central positioning rule contains both the statistical operator and the regular expression, the rule needs to be integrally matched with the text, that is, besides the character strings in the text needing to be matched with the statistical operator, the character strings before and after the character strings matched with the statistical operator also need to be matched with the regular expressions before and after the statistical operator in the rule, that is, the statistical operator, the regular expression and the sequential relationship and/or the logical operation relationship between the statistical operator and the regular expression can be matched with the text.

For example, assuming that the prepositioning rule is "PD (very beautiful | very beautiful)", the rule can match the character strings "prince very beautiful", "lie very beautiful", etc. in the text, but cannot match the character strings like "prince very beautiful", "lie very beautiful", etc.

S400: and extracting character strings matched with the information extraction rule from the effective extraction area.

In an implementation manner, the information extraction rule only includes a regular expression and does not include a statistical operator, and then the regular expression is used for matching with the characters in the effective extraction area, so that the character strings matched with the information extraction rule can be extracted, that is, the information that the user desires to extract from the text.

The following is an example.

Text 1 of information to be extracted:

the Wang-II columna is general in former body type, later he insists on exercising every day, and is now very strong.

Extracting expression 1: PD @ (Medium stature | general body type) @

And identifying the Wangzi column in the text 1 as a named entity by using a statistical operator PD in the region determination rule, namely a statistical model represented by the entity identification operator, and marking an identification tag, namely a first tag, for the Wangzi column in the text 1. Then, the preposed positioning rule "PD" in the region determination rule is compared with the text 1, and since the statistical model corresponding to the first label marked on the "wang di zhu" in the text 1 is the same as the statistical model represented by the PD in the preposed positioning rule "PD", the "wang di zhu" is matched with the preposed positioning rule "PD", and the position of the "wang di zhu" in the text 1 is determined as the starting position. The postfix rule is null, so the last character of text 1 is determined to be the end position. Thus, the effective extraction area 1 in the text 1 is determined as "general with former body type, and later, he insists on fitness every day and is now very strong and robust. ".

In this example, if the post-positioning rule of expression 1 to be extracted is replaced by "(DC) {0,10} robust", where DC is a dependency operator, the dependency in the text can be identified using the statistical model characterized by this operator; ". {0,10} is a regular expression. If a dependency component exists in the text and the 0-10 characters following the dependency component include two characters "robust", it means that the string from the dependency component to robust matches the post-positioning rule. Applied in text 1, the statistical model characterized by DC can identify the "exercise" dependent component, and mark it with the corresponding identification tag, the second tag. Since the statistical model corresponding to the second tag is the same as the statistical model characterized by DC, the string "exercise" marked by the second tag can be matched with "(DC)" in the post-positioning rule. Within 10 characters after "exercise" is again matched "sound", so that "exercise, now very strong" character string is matched with the post-positioning rule. Thereby determining that the effective extraction area 2 in the text 1 is "general in the former body type, and he adheres to it every day later".

The information extraction rule is "medium body shape i general type", so that the character string "general body shape" can be extracted in the effective extraction area 1.

Optionally, referring to fig. 2, in another implementation manner of S400, the information extraction rule includes a statistical operator, that is, as shown in step S101 in fig. 2, the step of extracting a character string matched with the information extraction rule from the effective extraction area specifically includes:

s410: and extracting the character strings matched with the information extraction rule from the effective extraction area by using the identification tags.

Here, when the information extraction rule includes a statistical operator, the process of matching the character string in the effective extraction region is similar to the process of matching the prepositioning rule and the like with the text in the aforementioned step S300. If the information extraction rule only contains the statistical operator, the character string (named entity or dependency component) marked by the identification tag can be considered to be matched with the statistical operator as long as the statistical model corresponding to the identification tag and the statistical model characterized by the statistical operator are the same model. Because the information extraction rule only comprises a statistical operator, the character string marked by the identification label is matched with the rule, and the character string is extracted. If the information extraction rule includes both the statistical operator and the regular expression, the rule needs to be integrally matched with the text, that is, besides the character strings in the text needing to be matched with the statistical operator, the character strings before and after the character strings matched with the statistical operator also need to be matched with the regular expressions before and after the statistical operator in the rule, that is, the statistical operator, the regular expression and the sequential relationship and/or the logical operation relationship between the statistical operator and the regular expression can be matched with the text, and the matched character strings can be extracted as the information extracted from the text.

Still following the foregoing example of text 1, the extraction of expression 2 is: PD @ (medium stature | body type general) {0,10} DC @.

By pre-positioning rules and post-designation rules, the effective extraction area 1 can be determined to be "in the general form of the pre-body, and later on he keeps exercising daily, and is now very robust. ".

The statistical model characterized by the dependency component operator DC can identify "exercise" in text 1 as a dependency component, which is labeled with a second label. And replacing the 'exercise' in the effective extraction area with a second label, wherein the statistical model corresponding to the second label is the same as the statistical model represented by the DC, so that the character string 'exercise' marked by the second label can be matched with the 'DC' in the post-positioning rule. The character string is matched with the character string of ' general body ' within 0-10 characters before ' exercise ', so that ' general body ' and later, the user insists on exercise every day ' is matched with the information extraction rule and extracted from the effective extraction area.

By the method, the rules and statistics are combined, the statistical model can be flexibly called in the region determination rule and/or the information extraction rule, and the method can be combined with the regular expression to obtain the extraction expression with richer forms. Compared with a simple rule expression, the extraction expression is used for extracting information, so that the range of recognized words is enlarged, information required by a user can be extracted more comprehensively, and a large amount of labor and time are avoided being consumed when the rule is constructed; compared with a method based on a single statistical model, the method can more accurately extract the information required by the user.

When one extraction expression comprises a plurality of statistical operators, the statistical models represented by the statistical operators can be identified for the text once respectively, and corresponding identification labels are marked for the identified contents respectively. Further alternatively, referring to fig. 3 and 4, if the region determination rule includes only any one of a statistical operator characterizing the first model and a statistical operator characterizing the second model, and the information extraction rule includes the other one of the statistical operator characterizing the first model and the statistical operator characterizing the second model, the step of S200 includes:

s201: identifying named entities in the text by using a first model, and marking a first label for each identified named entity;

s202: dependency components in the active extraction area are identified using a second model, and a second label is labeled for each identified dependency component.

Alternatively, the step of S200 includes:

s203: identifying dependent components in the text by using a second model, and marking a second label for each identified dependent component;

s204: and identifying the named entities in the effective extraction area by using a first model, and marking a first label for each identified named entity.

Under the condition that the information extraction rule comprises other statistical operators which are not contained in the region determination rule, the statistical models represented by all the statistical operators in the extraction expression are not required to be used for recognizing the text once, but the statistical models represented by the statistical operators in the region determination rule can be used for recognizing the text once, after the effective extraction region is determined, the statistical models represented by the statistical operators which are contained in the information extraction rule and are not contained in the region determination rule are used for recognizing the effective extraction region once, so that the length of the text which needs to be recognized by part of the statistical models is reduced, the recognition speed is increased, and the information extraction speed is increased.

It should be noted here that in this implementation, the steps of S202 and S204 are logically known to be after the step of S300. In the present application, the numbering of the steps is merely for convenience of description, and is not used to limit the order of the steps in the method, and the order of execution of the steps in the method may be changed as long as the logic is reasonable.

Alternatively, as shown in table 2, the categories of the first tag may include a person name tag, a place name tag, and an organization tag, and the categories of the second tag include a core component tag, a dependency word tag, an action component tag, and a subject component tag.

In step S201 and/or step S204, the step of marking a first label for each identified named entity includes:

and if the named entity identified by the first model is a person name, a place name or a mechanism, marking the identified named entity with a corresponding person name label, place name label or mechanism label.

In step S202 and/or step S203, the step of labeling each identified dependent component with a second label includes:

if the identified dependency component is a core component, a dependency word, an event component or an event component using the second model, then labeling the identified dependency component with a corresponding core component label, dependency word label, event component label or event component label.

If the region determination rule includes a statistical operator, referring to fig. 5, the step of S300 may include:

s301: comparing the region determination rule with the text, wherein if a specified label carried by a statistical operator in the region determination rule is matched with the type of the first label/the second label, the statistical operator is matched with the character string marking the first label/the second label, and the specified label is used for representing the type of the named entity or the type of the dependent component which is expected to be identified from the text by the user;

s302: and determining an effective extraction area according to the position of the area determination rule matched with the text.

If the information extraction rule includes a statistical operator, referring to fig. 6, the step of S410 may include:

s411: comparing the information extraction rule with the effective extraction area, wherein if the specified label carried by a statistical operator in the information extraction rule is matched with the type of the first label/the second label, the statistical operator is matched with the character string marking the first label/the second label;

s412: and extracting the character strings matched with the information extraction rules.

TABLE 2 example of correspondence between multiple statistical operators, multiple statistical models, and multiple identification tags

No matter the region determination rule or the statistical operator in the information extraction rule, the named entities/dependency components identified by the statistical model represented by the region determination rule or the statistical operator in the information extraction rule can be further classified, and different types of labels are correspondingly marked, so that the information expected to be extracted from the text by the user is more accurately limited by extracting the specified labels of the statistical operator in the expression, the extraction error condition is reduced, and the accuracy of the extracted information is improved.

For example, the information extraction rule a is "PD {0,2} is sufficiently beautiful", and the information extraction rule b is "(PD _ PER) {0,2} is sufficiently beautiful".

The effective extraction area of text 2 is: henan province is very beautiful, and Wang Er Ni who came from Henan province is also very beautiful.

When the information extraction rule a is adopted for extraction, named entities 'Henan province', 'Henan province' and 'Wang Dinie' in an effective extraction area are identified by using a first model represented by a PD, and first labels are marked on the two character strings respectively. And then, matching the information extraction rule a with the effective extraction area of the text 2, wherein the statistical model corresponding to the first label on the first Henan province is the first model and is the same as the one represented by the PD, so that the first Henan province can be matched with the PD. The first "Henan province" is followed by "Pai Mei", and the interval between the two is 0 characters, which can match with the regular expression "{ 0,2} Pai Mei" in the information extraction rule a. Therefore, "Henan province is very beautiful" is matched with rule a, and can be extracted.

Similarly, the second "Henan province" can be matched with the PD, but the subsequent character string contains "quite beautiful" and has more than 2 characters from the second "Henan province", so that the character string cannot be matched with the PD.

Similarly, "wang diane" can be matched with PD, and the interval between the subsequent character strings "ten beautiful" and "wang diane" is 1 character, which can be matched with the regular expression "{ 0,2} ten beautiful" in the information extraction rule a. Therefore, "Wang Er Ni is also very beautiful" is matched with rule a, and can be extracted.

When the information extraction rule b is adopted for extraction, named entities 'Henan province', 'Henan province' and 'Wang Dinie' in an effective extraction area are identified by using a first model represented by a PD, and since two 'Henan provinces' are place names, place name labels are respectively marked for the two 'Henan provinces'; "Wang Er Ni" is the name of a person, and is a label for marking the name of the person.

And then, matching the information extraction rule b with an effective extraction area of the text 2, wherein although the statistical models corresponding to the place name tags on the first and second Henan provinces are the same as the first model and are represented by the PD, the designated tag carried by the statistical calculation son in the rule b is 'PER', namely the type of the named entity expected to be identified from the text by the user is a person name and cannot be matched with the place name tag, so that the statistical operator '(PD _ PER') in the rule b cannot be matched with the two Henan provinces.

Similarly, for "king-disnie", the label is a name label, which can be matched with the statistical operator "(PD _ PER)" in rule b. And the ' very beautiful ' behind the ' Wang Dinie ' can be matched with the regular expression ' 0,2 ' in the rule b, so that the ' very beautiful ' of the Wang Dinie ' is matched with the rule b and can be extracted.

It can be seen that, by adopting the information extraction rule a for extraction, two character strings of "very beautiful in Henan province" and "very beautiful in Wang Dinie" can be extracted from the effective extraction area. And if the information extraction rule b is adopted for extraction, only the character string of 'Wang Dinie is also very beautiful' is extracted from the same effective extraction area.

Alternatively, either the region determination rule or the information extraction rule may include a business element concept and/or a general concept.

In the present application, the general concept refers to word meaning information of vocabularies irrelevant to specific services in the text and semantic relevance among the vocabularies. A general concept may represent a group of words or may represent a sentence. A general concept is a description of an object that reflects an abstract representation of the essential attributes of the object it describes, such as time, place, mood, rating, etc. The general concept can be reused in different fields and different application scenes. The general concept may be denoted by "c".

For example, for a general concept of "negate," i.e., "c _ negate," it may mean "none," "none," and the like. That is, when any of "none", and "none" is included in a text, the word in the text is considered to match the general concept of "c _ negated".

As another example, for a general concept of "not full", i.e., "c _ not full", it may mean "[. lamda.no.. {0,5} not full". Wherein, [ < Lambda > ]. {0,5} unsatisfied 'represents that when the text is matched, as long as the text which comprises 0-5 characters before the text is not satisfied, the text is matched by the [ < Lambda > ]. {0,5} unsatisfied', for example, "very unsatisfied" and the like, and sentences with reverse semantics such as "not unsatisfied", and the like are eliminated. Therefore, if a text that includes 0-5 characters before "not full" in a text, and does not include "not" in the 0-5 characters, the string is considered to match the general concept of "c _ not full".

The service element concept refers to semantic information of vocabularies related to specific services and semantic relevance among the vocabularies. Similar to the general concept, the business element concept may also represent a set of words or may represent a sentence. The service element concept is a description of an object or an attribute thereof related to a service, and is often related to a field and different services, and cannot be reused in different fields or different application scenes. The business element concept may be denoted by "e".

For example, in the field of credit card customer service of banks, the business element concept of "fake information", i.e., "e _ fake information", may represent words such as "fake short message", "fake incoming call", "fake email", and the like. When any one of a 'fake short message', 'fake incoming call' and 'fake mail' is included in one text, it means that the vocabulary in the text is matched with the general concept of 'e _ fake information'.

The semantic model is a text expression form which is oriented to known concepts and used for describing the semantics of the known concepts in a short way from sample data. A plurality of general concepts irrelevant to the service are organized in a tree structure to form a concept tree. A concept tree can be understood as a semantic model. A plurality of service element concepts related to the service are organized in a tree structure to form an element tree. An element tree can also be understood as a semantic model. The text can be identified by utilizing the semantic model, and whether character strings matched with the general concepts or the business element concepts in the semantic model exist in the text or not can be determined.

In the scheme of the application, the general concept or the business element concept can be introduced into the rule to form the extraction expression with richer forms, so that the information can be accurately and comprehensively extracted.

In this embodiment, the region determination rule includes the statistical operator, so that when the region determination rule also includes the business element concept and/or the general concept, the statistical operator and the business element concept and/or the general concept also have a precedence relationship and/or a logical operation relationship, similar to the case of further including the regular expression. Furthermore, the region determination rule may also include statistical operators, regular expressions, business element concepts, and/or general concepts.

For the information extraction rule in this embodiment, it may or may not include a statistical operator. The method can comprise one or more of a business element concept, a general concept, a statistical operator and a regular expression, and one or more of the business element concept, the general concept, the statistical operator and the regular expression are selected to be combined according to different application scenes, so that the aim of more accurately extracting information is fulfilled. When the regular expression contains several elements, the element concept/general concept and the statistical operator or the regular expression have a precedence relationship and/or a logical operation relationship.

This is further illustrated below by way of an example.

Extract expression 3: @ (PD _ PER | PD _ POS) @ c _ recognition evaluation

Among them, the general concept (c) is:

c _ positive evaluation: very beautiful, rich in product and smart.

Text 3: wang Ernie, although not reading a book, her son is clever to fly.

The preposition rules are null, starting with the first character of text 3. The post-positioning rule is "c _ recognition evaluation" and may be matched to "very smart" in the text 3, and the matched position is taken as the end position. From the start position and the end position, it can be determined that the effective extraction area in the text 3 is "wang disnie has not read the book, but her son has flown".

And identifying by using a first model represented by the PD in the effective extraction area, and identifying 'Wang Dinie' and 'Zhang Fei', wherein name labels are respectively marked for the 'Wang Dinie' and the 'Zhang Fei'. Since the types of the two are matched with the designated label ' PER ' carried by the PD in the information extraction rule, the ' wang bieni ' and ' zhangfei ' are matched with the information extraction rule ' (PD _ PER | PD _ POS), and two character strings of ' wang bieni ' and ' zhangfei ' can be extracted from the text 3.

Alternatively, a separation distance between the area determination rule and the information extraction rule may also be defined.

For example, extract expression 4: @ (PD _ PER | PD _ POS) @ {0,2} c _ recognition evaluation

Wherein "{ 0,2 }" in the region determination rule means that the distance between the extracted person or place name and the location where the common concept c _ recognition evaluation matches is 0-2 characters.

Continuing with the text 3, since "wang diani" can match the information extraction rule, the distance between "wang diani" and "smart" exceeds 2 characters; the separation distance between "Zhang Fei" and "Smart" is 0 characters, so that only one character string of "Zhang Fei" is extracted from the text 3.

Referring to fig. 7, in a second embodiment, an information extraction method is provided, which includes the following steps S500-S800.

S500: the method comprises the steps of obtaining a text of information to be extracted and an extraction expression, wherein the extraction expression comprises an area determination rule and an information extraction rule, the information extraction rule comprises a statistical operator, and the statistical operator represents a statistical model used for identifying named entities and/or dependency components in the text.

S600: and identifying the named entities and/or the dependent components in the text by using a statistical model, and marking the identified named entities and/or the dependent components with corresponding identification tags respectively.

S700: and determining an effective extraction area in the text by using the area determination rule.

S800: and extracting the character strings matched with the information extraction rule from the effective extraction area by using the identification tags.

The description of the text, the named entity, the dependency component, the statistical operator, the extraction expression, and the like of the information to be extracted in step S500 may refer to the related description of step S100 in the first embodiment, and is not repeated here. The difference between this step and the step S100 is that in the extraction expression obtained in this step, the information extraction rule includes a statistical operator, and the region determination rule may or may not include a statistical operator.

The step S600 is described in relation to the step S200 in the first embodiment, and is not described herein again.

If the region determination rule does not include any statistical operator, the step S600 may be logically after the step S700, and specifically includes: and identifying the named entities and/or the dependent components in the effective extraction area by using a statistical model, and marking the identified named entities and/or the dependent components with corresponding identification tags respectively.

In step S700, if the statistical operator is not included, the method for determining the effective region using the rule may directly adopt a method in the prior art, such as regular expression matching. If the region determination rule also includes a statistical operator, the step S700 may specifically include: and comparing the area determination rule with the text by using the identification tag, and determining an effective extraction area in the text.

For this step, reference may be specifically made to the related description of the step S300 in the first embodiment, and details are not described herein.

The step of step S800 may refer to the related description of step S410 in the case that the information extraction rule includes the statistical operator in the first embodiment, and details are not repeated here.

Similarly to the first embodiment, if the information extraction rule includes any one of a statistical operator characterizing the first model and a statistical operator characterizing the second model, and the region determination rule includes only the other one of the statistical operator characterizing the first model and the statistical operator characterizing the second model, the step of S600 includes:

s601: identifying named entities in the text by using a first model, and marking a first label for each identified named entity;

s602: dependency components in the active extraction area are identified using a second model, and a second label is labeled for each identified dependency component.

Alternatively, the step of S600 includes:

s603: identifying dependent components in the text by using a second model, and marking a second label for each identified dependent component;

s604: and identifying the named entities in the effective extraction area by using a first model, and marking a first label for each identified named entity.

Through the mode, the statistical models represented by all the statistical operators in the extraction expression are not required to be used for recognizing the text once, but the statistical models represented by the statistical operators in the region determination rule can be used for recognizing the text once, and after the effective extraction region is determined, the statistical models represented by the statistical operators contained in the information extraction rule and not contained in the region determination rule are used for recognizing the effective extraction region once, so that the length of the text needing to be recognized by part of the statistical models is reduced, the recognition speed is increased, and the information extraction speed is increased.

Alternatively, the categories of the first tag may include a person name tag, a place name tag, and an organization tag, and the categories of the second tag include a core component tag, a dependent word tag, an action component tag, and a subject component tag, similarly to the first embodiment. Either the region determination rule or the information extraction rule may include a regular expression, a business element concept, and/or a general concept. When the region determination rule includes one or more of the regular expression, the service element concept, the general concept, and the statistical operator, different regular expressions, different service element concepts, different general concepts, and/or different statistical operators may be combined, that is, they have a sequential relationship and/or a logical operation relationship. Reference may be made to the related description in the first embodiment, and details are not repeated herein.

In a third embodiment of the present application, an information extraction apparatus corresponding to the foregoing information extraction method is provided, please refer to fig. 8, and in a first implementation manner, the information extraction apparatus includes:

the information extraction method comprises a first acquisition unit 1, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a text of information to be extracted and an extraction expression, the extraction expression comprises an area determination rule and an information extraction rule, the area determination rule comprises a statistical operator, and the statistical operator represents a statistical model used for identifying named entities and/or dependency components in the text;

the first processing unit 2 is configured to identify the named entities and/or the dependency components in the text by using a statistical model, and mark corresponding identification tags for the identified named entities and/or dependency components respectively; comparing the area determination rule with the text by using the identification tag, and determining an effective extraction area in the text; and extracting the character string matched with the information extraction rule from the effective extraction area.

Optionally, the first processing unit 2 is specifically configured to, when the information extraction rule includes a statistical operator, extract, by using the identification tag, a character string matching the information extraction rule from the effective extraction area.

Optionally, the statistical model comprises a first model for identifying the named entity and a second model for identifying the dependency component, the identification tag comprising a first tag and a second tag; the first processing unit 2 is further specifically configured to, in a case where the information extraction rule includes any one of a statistical operator characterizing the first model and a statistical operator characterizing the second model, and the region determination rule includes only the other one of the statistical operator characterizing the first model and the statistical operator characterizing the second model, identify named entities/dependency components in the text using the first model/second model, and label a first label/second label for each identified named entity/dependency component; and identifying the dependent components/named entities in the effective extraction area by using the second model/the first model, and marking a second label/a first label for each identified dependent component/named entity.

Optionally, the categories of the first tag include a person name tag, a place name tag, and an organization tag, and the categories of the second tag include a core component tag, a dependency word tag, a composition component tag, and a subject component tag.

The first processing unit 2 is further specifically configured to, in a case that the named entity identified by using the first model is a person name, a place name, or an organization, mark a corresponding person name tag, place name tag, or organization tag for the identified named entity; in the case that the dependency component identified by the second model is a core component, a dependency word, an event component or an event component, marking the identified dependency component with a corresponding core component label, dependency word label, event component label or event component label; comparing the region determination rule with the text; and determining an effective extraction area according to the position of the area determination rule matched with the text. Wherein, if the specified label carried by the statistical operator in the region determination rule is matched with the category of the first label/second label, the statistical operator is matched with the character string marking the first label/second label, and the specified label is used for representing the type of the named entity or the type of the dependent component which is expected to be identified from the text by the user.

Optionally, the first processing unit 2 is further specifically configured to compare the information extraction rule with the effective extraction area, and extract a character string matching the information extraction rule. And if the specified label carried by the statistical operator in the information extraction rule is matched with the type of the first label/the second label, the statistical operator is matched with the character string marking the first label/the second label.

Optionally, the information extraction rule or the region determination rule further includes a regular expression, where the statistical operator and the regular expression have a precedence relationship and/or a logical operation relationship. The region determination rule or the information extraction rule further comprises a business element concept/general concept, and the business element concept/general concept has a sequential relation and/or a logical operation relation with the statistical operator or the regular expression.

Referring to fig. 9, in a second implementation manner, the apparatus includes:

the second obtaining unit 3 is configured to obtain a text of information to be extracted and an extraction expression, where the extraction expression includes an area determination rule and an information extraction rule, the information extraction rule includes a statistical operator, and the statistical operator represents a statistical model used for identifying a named entity and/or a dependent component in the text;

the second processing unit 4 is configured to identify the named entities and/or the dependency components in the text by using a statistical model, and mark corresponding identification tags for the identified named entities and/or dependency components respectively; determining an effective extraction area in the text by using the area determination rule; and extracting the character string matched with the information extraction rule from the effective extraction area by using the identification label.

The second processing unit 4 may specifically refer to the first implementation, and is not described herein again. The information extraction device corresponds to the information extraction methods in the first and second embodiments, and has the corresponding advantages to the information extraction methods, which are not described herein again.

The same and similar parts in the various embodiments in this specification may be referred to each other. The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims

1. An information extraction method, comprising:

acquiring a text of information to be extracted and an extraction expression, wherein the extraction expression comprises a region determination rule and an information extraction rule, the region determination rule comprises a statistical operator, and the statistical operator represents a statistical model for identifying a named entity and/or a dependency component in the text; the region determination rule further comprises a regular expression, wherein the statistical operator and the regular expression have a sequential relation and/or a logical operation relation;

2. The method according to claim 1, wherein the information extraction rule comprises a statistical operator;

3. The method of claim 2, wherein the statistical model includes a first model for identifying named entities and a second model for identifying dependency components, the identification tag including a first tag and a second tag;

identifying named entities in the text by using a first model, and marking a first label for each identified named entity;

identifying dependent components in the active extraction area using a second model, and labeling each identified dependent component with a second label;

or, identifying the dependent components in the text by using a second model, and marking a second label for each identified dependent component;

and identifying the named entities in the effective extraction area by using a first model, and marking a first label for each identified named entity.

4. The method of claim 3, wherein the categories of the first tag include a person name tag, a place name tag, and an organization tag, and the categories of the second tag include a core component tag, a dependency word tag, a composition of affairs tag, and a composition of affairs tag;

the step of tagging each identified named entity with a first tag, comprising:

5. The method according to claim 4, wherein the step of extracting the character string matching the information extraction rule from the effective extraction area using the identification tag includes:

6. The method according to claim 1, wherein the region determination rule or the information extraction rule further comprises a business element concept/general concept, and the business element concept/general concept has a precedence relationship and/or a logical operation relationship with the statistical operator or the regular expression.

7. An information extraction apparatus, characterized by comprising:

the information extraction method comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a text of information to be extracted and an extraction expression, the extraction expression comprises a region determination rule and an information extraction rule, the region determination rule comprises a statistical operator, and the statistical operator represents a statistical model used for identifying named entities and/or dependency components in the text; the region determination rule further comprises a regular expression, wherein the statistical operator and the regular expression have a sequential relation and/or a logical operation relation;