CN104636466B

CN104636466B - Entity attribute extraction method and system for open webpage

Info

Publication number: CN104636466B
Application number: CN201510071993.6A
Authority: CN
Inventors: 程学旗; 贾岩涛; 赵泽亚; 王元卓; 靳小龙; 熊锦华; 李曼玲; 林海伦; 许洪波
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2015-02-11
Filing date: 2015-02-11
Publication date: 2020-07-31
Anticipated expiration: 2035-02-11
Also published as: CN104636466A

Abstract

The invention provides an entity attribute extraction method and system for an open webpage. Wherein the method comprises the following steps: extracting texts of the open web pages, and obtaining a candidate text set of a target entity from the texts; and selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute appearing in the training text set. The method and the device can improve the accuracy and the recall rate of the extraction of the entity attributes of the open webpage, do not depend on the webpage structure, and can adapt to the change of the types of the open webpage.

Description

Entity attribute extraction method and system for open webpage

Technical Field

The invention relates to the technical field of data mining, in particular to an entity attribute extraction method and system for an open webpage.

Background

The open web page refers to an unstructured internet web page with a data source not fixed and containing various network data, such as blogs, forums, news, chat records, e-mails and the like, wherein the nature and the amount of the information are not fixed, and the occurrence positions of the information are not fixed, and all the contents are unpredictable. With the development of network technology, especially the rapid development of Internet and Intranet technology, the number of open web pages is rapidly increased and difficulties are brought to the text understanding due to the self characteristic of flexible structure:

1. the text structure is not fixed, and no specific context grammar exists;

2. the scope of the keywords is not fixed, and the related subject fields are various;

3. the text length is not fixed, and the difference of the context information quantity is large;

4. the data source is not fixed, and the language phenomenon is complex.

An entity refers to things that exist objectively and can be distinguished from each other, and may be a specific objective object or an abstract event. The entity attribute refers to the nature of an entity, and entity attribute extraction is a key technology for text understanding, wherein the entity attribute extraction is used for reflecting the relevant conditions of the entity from different angles by concentrating the attributes of different information sources to the entity, so that the knowledge of the entity is perfected, and the entity attribute extraction plays an important role in research such as information extraction, event tracking, name disambiguation and the like.

Aiming at the characteristics of an open webpage, the traditional entity attribute extraction method has the following limitations:

the text structure of the first open webpage is not fixed, the entity and the description thereof have no fixed rule and can be circulated, and most of the entity and the description thereof are in free text and are not easy to extract and analyze;

secondly, in the traditional attribute extraction method facing the rules, the rules define the deadlines, the context grammar is excessively depended on, and the matching efficiency is low;

thirdly, the data source of the open webpage is not fixed, the language phenomenon is complex, common rules are difficult to cover, and the traditional rule-based attribute extraction does not support the nesting matching of the rules;

fourth, the traditional entity attribute extraction method based on statistics, the preparation of training data is too dependent on manpower, the efficiency is not high, and the accuracy and recall rate are low;

fifth, the traditional attribute extraction is mostly limited to be performed in a certain field or subject, the system cannot be directly transplanted to other fields or subjects for use, and the system lacks the association characteristics with universality and is not easy to be transplanted and expanded.

Disclosure of Invention

To solve the above problem, according to an embodiment of the present invention, an open web page-oriented entity attribute extraction method is provided, including:

step 1), extracting texts of open webpages, and obtaining a candidate text set of a target entity from the texts;

and 2) selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute in the training text set.

In the above method, step 1) comprises:

step 11), extracting an unstructured text from the open webpage, and segmenting words of the unstructured text to obtain the correlation degree between the words and the unstructured text;

step 12), one or more initial query expansion words closest to the target entity in the context of the target entity are obtained, and one or more unstructured texts with the highest relevance with the target entity and the one or more initial query expansion words are used as a first text set;

step 13), selecting one or more secondary query expansion words with highest word frequency from the first text set, and taking one or more unstructured texts with highest relevance with a target entity and the one or more secondary query expansion words as a second text set;

step 14), taking the union of the first text set and the second text set as a candidate text set of a target entity.

In the method, the relevance of the plurality of words to the unstructured text is the sum of the relevance of each of the plurality of words to the unstructured text.

In the above method, step 2) includes: calculating the frequency of the target entity attribute in the training text set, if the frequency exceeds a preset threshold value, extracting the value of the target entity attribute according to the constructed statistical model, otherwise, extracting the value of the target entity attribute according to the constructed cascade finite state automaton; wherein the set of training texts is used for training the statistical model.

In the method, a laminated finite state automaton is constructed according to the following steps:

step a), carrying out entity recognition in the candidate text set and generating a concept file; wherein the concept file comprises basic concepts indicating an entity type and entities belonging to the type and identified from the candidate text set; indicating a regular expression of a variable to be extracted; and, a token indicating a relationship between the entity and the attribute;

step b), generating a rule file comprising the concept file and the association rule; the association rule comprises a single rule or a rule nested with a plurality of sub-rules and is used for indicating the relationship among the basic concepts, the regular expressions and the signposts in the concept file;

step c), constructing a laminated finite state automaton according to the association rules in the rule file; wherein, the initial state of the laminated finite state automaton is a basic concept, a regular expression or a landmark; other states include association rules and sub-rules in association rules.

In the method, extracting the value of the target entity attribute according to the constructed cascade finite state automaton comprises the following steps:

matching the candidate text set with the finite state automaton from an initial state, and establishing an inverted index for the content matched in the candidate text set by each state;

and after matching is finished, obtaining the value of the target entity attribute from the established inverted index.

In the above method, the statistical model is constructed according to the following steps:

step A), obtaining training entities and corresponding training attributes from an online encyclopedia;

step B), obtaining a training text set of the training entity from a training open webpage;

step C), extracting features from the training text set, and performing label returning on the features of the training attributes to obtain training data of each attribute;

and D) generating a statistical model corresponding to each attribute according to the training data.

In the above method, step B) comprises:

step B1), extracting an unstructured text from the training open webpage, and segmenting the unstructured text to obtain the correlation degree between words and the unstructured text;

step B2), obtaining n initial query expansion words nearest to the training entity according to the context information of the training entity in the training open webpage, and taking K unstructured texts with the highest correlation degree with the training entity and the initial query expansion words as a third text set; wherein n and K are positive integers;

step B3), m secondary query expansion words with the highest word frequency are selected from the third text set, and L unstructured texts with the highest correlation degree with the training entities and the secondary query expansion words are used as a fourth text set, wherein m and L are positive integers;

step B4), taking the union of the third text set and the fourth text set as a training text set.

In the above method, step C) further comprises: removing impurities in the training data, and controlling the proportion of positive examples to negative examples in the training data.

In the above method, the characteristics include words, dependency relationships between words, word frequencies and word properties of words.

In the above method, extracting the value of the target entity attribute according to the constructed statistical model includes:

extracting features of the candidate text set in a manner that features are extracted when constructing the statistical model;

and inputting the extracted features into a statistical model corresponding to the target entity attribute to obtain the value of the target entity attribute.

The method further comprises the following steps:

and 3) correcting the extracted value of the target entity attribute according to the type, the part of speech or the value range of the target entity attribute.

According to an embodiment of the present invention, there is also provided an open web page-oriented entity attribute extraction system, including:

the webpage preprocessing module is used for extracting the text of the open webpage;

the query expansion module is used for obtaining a candidate text set of the target entity from the extracted text;

and the attribute extraction module is used for selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute appearing in the training text set.

The invention has the following beneficial effects:

1. an entity attribute extraction method based on a laminated finite state automaton is provided, and extraction of complex nesting rules is achieved;

2. in the extraction process based on the laminated finite state automaton, an inverted index is established for the extracted content of each state of the automaton, so that the rule matching efficiency is greatly improved;

3. a set of concept definition and rule definition language of irrelevant grammar is formulated, so that entity attribute extraction is separated from a context language environment, declarative information extraction is realized, and system compatibility is improved;

4. a set of sentence-level text features are provided for CRF model training, and the machine learning effect in attribute extraction can be improved;

5. the method for automatically generating the CRF training data according to the existing attribute information of the online encyclopedia attribute box (Infobox) is provided, and the part needing manual verification is provided for the effect of label return, so that the efficiency and the accuracy of the training data are improved;

6. the iterative query expansion method is provided, and the accuracy and recall rate of entity attribute extraction of the open webpage can be improved by verification;

7. and the entity attribute extraction of the open webpage is realized by adopting a rule-based or statistic-based extraction method in a self-adaptive manner according to the occurrence frequency of the attribute.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a flowchart of an open web page-oriented entity attribute extraction method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of an iterative query expansion method according to one embodiment of the present invention;

FIG. 3 is a flow diagram of a method of adaptive entity attribute extraction according to one embodiment of the present invention;

FIG. 4 is a flow diagram of a method for constructing a stacked finite state automaton and for attribute extraction based on association rules of the stacked finite state automaton, according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a stacked finite state automaton according to one embodiment of the invention;

FIG. 6 is a schematic diagram of an initial inverted index, according to one embodiment of the invention;

FIG. 7 is a diagram of a layered finite state automaton matching a set of candidate texts, according to one embodiment of the invention;

FIG. 8 is a schematic diagram of an inverted index upon completion of a match, in accordance with one embodiment of the present invention;

FIG. 9 is a flow diagram of a method for constructing a conditional random field model and extracting properties based on the conditional random field model according to one embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

According to one embodiment of the invention, an entity attribute extraction method facing an open webpage is provided.

In summary, the method comprises: extracting texts of the open web pages, and obtaining a candidate text set of a target entity from the texts; and selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute in the training text set.

Before describing the method for extracting the entity attribute facing the open web page, the entity attribute, the rule and the statistical model are explained first. The entity attribute comprises a target entity, an attribute name and an attribute value; the rules comprise rule types, target names and parameters and rule bodies, and the text sources of the features used by the statistical model comprise texts before attribute names, texts between the attribute names and attribute values and texts after the attribute values.

The steps of the method for extracting the entity attribute of the open web page will now be described in detail with reference to fig. 1. It should be noted that, the steps of the method described in the specification are not necessarily required, and one or more of the steps may be omitted or replaced according to the actual situation, and in addition, the order between the steps may be adjusted.

Step S101: open web page preprocessing

According to one embodiment of the invention, the preprocessing process of the open webpage comprises the following steps:

1. and acquiring an open webpage set to be extracted, and extracting webpage contents to obtain an unstructured text to be extracted.

2. The method comprises the steps of segmenting words of unstructured texts to be extracted, calculating the relevance between the words and each unstructured text to obtain a highest relevance (or matching degree) unstructured text set corresponding to each word, and establishing an inverted index according to the information.

In one embodiment, the degree of correlation between words and unstructured text is calculated based on word frequency and other characteristics. For example, the TF-IDF method can be used to obtain the relevance of a word to all unstructured texts, and then the k (k is a positive integer) unstructured texts with the highest relevance are used as the set of unstructured texts with the highest relevance of the word.

Step S102: obtaining candidate text set through iterative query expansion

And according to the inverted index established in the step S101, generating a candidate text set by inquiring, expanding and fusing the context information and the word frequency information of the target entity twice. FIG. 2 depicts one embodiment of the steps of an iterative query expansion method, comprising:

step S201, according to the context information of the target entity E, acquiring n (n is a positive integer) entities (words) closest to E in the context, which are called query expansion words.

In one embodiment, n-1 is selected, i.e., the words E1 and E2 before and after the target entity E are used as query expansion words.

Step S202, initial query expansion.

Querying a target entity and query expansion words in the inverted index established in step S101 to obtain a text set U of the unstructured text with the highest degree of correlation with the target entity and the query expansion words₁。

In one embodiment, the sum of the relevance of the target entity, the query expansion word and a certain unstructured text is taken as the relevance of the target entity, the query expansion word and the unstructured text and sequenced, so that a text set U is obtained₁(e.g. including 50 texts). In another embodiment, the highest-relevance unstructured text sets of the target entity and the query expansion word are respectively found, and the intersection is taken to obtain a text set U₁. Experiments show that the accuracy of entity attribute extraction can be improved in the initial query expansion process.

Step S203, the slave U₁M (m is a positive integer) words with the highest word frequency are selected.

In one embodiment, m is 2, i.e., U is selected₁The two entities with the highest word frequency, E3 and E4, perform the second query expansion.

And step S204, secondary query expansion.

The m words with the highest word frequency and the target entity E are inquired in the inverted index again to obtain a text set U with the highest relevance degree₂. For example, the method in step S202 is adopted to obtain U₂. Experiments show that the step can effectively promote the recall of entity attribute extractionRecovery rate and accuracy.

Step S205, taking the union set of the results of the two query expansion as a candidate text set U (referred to as candidate text set U for short) extracted by the entity attributes of the target entity E.

Step S103: adaptive entity attribute extraction

In summary, the adaptive entity attribute extraction process includes: according to the frequency of the target entity attribute (or target attribute) appearing in the training text set, different entity attribute extraction methods are selected in a self-adaptive mode. The training text set is a text set used for training a statistical model (the model is used for a statistical-based entity attribute extraction method and will be described below), and the frequency of the target attribute in the training text set can be obtained according to the frequency of the following token word in the training text set. Here, if the frequency of occurrence is higher than a predetermined threshold, a statistical-based entity attribute extraction method is employed, otherwise a rule-based entity attribute extraction method is employed. The reason for this is that: for entity attributes with lower occurrence frequency, the accuracy and the execution efficiency of the method based on the rules are better; for attributes with higher frequency of occurrence, the statistical-based approach is chosen more comprehensively.

The entity attribute extraction method based on the rules can realize rule nesting by constructing a laminated finite state automaton, and establishes an inverted index for text contents matched with each state (or called node) of the laminated finite state automaton, so as to quickly realize matching of complex text modes and obtain entity attribute values; the entity attribute extraction method based on statistics can be used for carrying out supervised machine learning according to the conditional random field principle, selecting text characteristics and training a statistical model (such as a conditional random field model) to extract entity attributes. As shown in FIG. 3, the adaptive entity attribute extraction process may include the following sub-steps:

and S301, constructing a layered finite state automaton.

And (3) carrying out entity recognition on the candidate text set U, formulating a set of declarative language specifications of irrelevant grammars, defining a concept set and an association rule set, and constructing a layered finite state automaton according to the nesting dependency relationship of the rules.

And S302, training a statistical model.

Selecting text features, generating training data, and training to obtain statistical model, such as CRF model M_CRF。

Step S303, calculating the frequency of the target entity attribute in the training text set, and judging whether the frequency exceeds a preset threshold value.

And step S304, if the judgment result in the step S303 is negative, adopting an association rule based on a laminated finite state automaton to extract the attribute (namely, adopting a rule-based entity attribute extraction method).

If the determination result in step S305 is yes, the attribute extraction is performed by machine learning based on the conditional random field (that is, the entity attribute extraction method based on statistics). Sentence-level feature extraction is performed on the candidate text set U to generate a feature vector, the statistical model generated in the step S302 is input, and a target attribute value is extracted.

It should be understood that the order of the above sub-steps may be reversed, for example, the sub-step of training the statistical model may be performed at any time before or simultaneously with the construction of the stacked finite state automaton.

The foregoing general description of the adaptive entity attribute extraction process is described, and then the attribute extraction is performed on the basis of the stacked finite state automata to construct the stacked finite state automata respectively; training statistical models based on statistical models (in particular conditional random field model M)_CRF) The process of performing attribute extraction is described in detail.

Fig. 4 depicts a method for constructing a stacked finite state automaton and extracting attributes based on association rules of the stacked finite state automaton, and the following substeps of the method are:

and step S401, entity identification.

And carrying out named entity identification in the candidate text set U to obtain an entity set, and determining the types of the entities, such as people, places, mechanisms and the like.

Step S402, generating a concept file.

The CONCEPT file is a collection of all CONCEPTs, including the CONCEPT basic CONCEPT, the REGEX regular expression, and the CONCEPT logograms. Declarative definition using a language that is independent of the contextual grammar:

(1) the CONCEPT of the CONCEPT is a type variable to be extracted, such as people, places, organizations, etc. (a type is defined as a CONCEPT object). The definition format is CONCEPT: [ CONCEPT name ]: [ example value ], wherein the CONCEPT name is a type variable, and the example value is an entity set of the type and can be regarded as a CONCEPT pair of < type variable-type variable value range >. Wherein, the basic concept is obtained from the entity set generated in step S401. For example: all entities of the "organization" type are referred to as a concept named ORG, whose value range is an example of the type of organization in the text, such as graduate institute of science and technology university, beijing university, chinese institute of technology, etc., and the ORG concept is defined as shown in table 1.

TABLE 1

(2) The REGEX regular expression is a regular expression of variables to be extracted, and has a format of REGEX: [ concept name ]: [ regular expression content ], and an example of the REGEX regular expression is given in table 2.

TABLE 2

REGEX:DATE:([\d]Year {4} {0,1} ([ \ d)]{1,2} month) {0,1} ([ \ d)]{1,2} day) {0,1}

Where DATE is the name of the regular expression, and ([ \ d ] {4} year) {0,1} ([ \ d ] {1,2} month) {0,1} ([ \ d ] {1,2} day) {0,1} is the regular expression referred to by DATE.

(3) The CONCEPT flag word is a flag word related to the attribute to be extracted, namely a relation flag word of an entity and the attribute, is used for making an association rule and is in a format of CONCEPT: [ CONCEPT name ]: [ flag word value ]. For example, when the attribute "date of birth" of a person is to be extracted, the required flags can be as shown in table 3.

TABLE 3

The above combinations of CONCEPT and REGEX are merged into an overall CONCEPT set, i.e. a CONCEPT file is generated, as shown in Table 4.

TABLE 4

And step S403, generating a rule file.

The association rule MCONCEPT _ RU L E characterizes the relationship between concepts by Boolean logic constraints and context constraints on the concepts in the format of MCONCEPT _ RU L E: [ rule name ] ([ to-be-output variable ]):

(1) AND: the character strings in which all clauses appear are matched;

(2) OR: as long as a clause appears, the character string is matched;

(3) SENT: all clauses appear in the same sentence, the character string will be matched;

(4) ORD: all the clauses appear simultaneously according to the sequence defined by the rule, and the character string is matched;

(5) DIST _ n: when all clauses appear in a string at the same time and the distance (separation distance) between adjacent clause instances does not exceed n words, the string is matched.

CONCEPT and REGEX generated in step S402 are both CONCEPTs to be extracted, the matching text of which is an instance of the CONCEPT, as shown in Table 5, which extracts the person' S "date of birth". fig. "

TABLE 5

MCONCEPT_RULE:NAME_BIRTHDAY(person,birthday):(DIST_20,

"_person{NAME}","BIRTH_OR","_birthday{DATE}")

Where, NAME _ bithday is a rule NAME, (person, BIRTHDAY) indicates that two variables of person and BIRTHDAY are output, where person is an example of NAME concept and BIRTHDAY is an example of DATE variable. The meaning of NAME _ BIRTHDAY is: if the NAME, BITRH _ OR, and instances of DATE concepts appear simultaneously and are no more than 20 words apart, the clause matching NAME (i.e., the instance of NAME) is output as person and the clause matching DATE is output as birthday.

For example, Table 6 shows the NAME _ CO LL EGE rule, extracting "graduation time and college" of a person.

TABLE 6

Table 6 shows the following: first, the successful (ORD, "DEGREE _ GET _ OR," "DE GREE") sub-rule must be matched, i.e., there is a "DEGREE" sub-sentence that appears after "DEGREE _ GET _ OR"; secondly, after the sub-rules are successfully matched, if the sentence in which the sub-sentence is located has the NAME, DATE and ORG concept instances, the sub-sentence with NAME matching (i.e. the NAME instance) is output as person, the sub-sentence with DATE matching is output as graduatetime and the sub-sentence with ORGANIZATI ON matching is output as college.

The rule file was generated by all MCONCEPT _ RU L E and all concept collections, annotated with "#" beginning to represent the row, and table 7 shows the rule file for the attributes "date of birth", "graduate school", "contact address" of the abstractor.

TABLE 7

And step S404, constructing a layered finite state automaton.

And transferring the rules into a group of finite state automata with a dependency relationship with each other according to the nesting dependency relationship among the rules, wherein each concept is an initial state, and the laminated finite state automata is generated step by step through the constraint relationship among the concepts and the rule nesting dependency relationship. The cascade finite state automaton is tree-shaped, the bottom layer is an initial state, and the state which can be transferred can be regarded as a parent state; the initial state is a concept, the parent state is a rule or a sub-rule, and the transfer function is a constraint condition of the rule or the sub-rule. And gradually transferring upwards from the initial state to a rule or sub-rule state through the constraint conditions and the nesting relation of the rules to form the layered finite state automaton, which is shown in fig. 5.

And S405, matching the candidate text set U with the initial state of the laminated finite state automata, and establishing an initial inverted index.

In the step, the candidate text set U is matched with the initial state, namely, all concepts are matched, and an inverted index is established for the text matched with the initial state. Each state is used as a term, the text matched with the state is used as a reverse record table of the term, and each term has a pointer pointing to the reverse record table thereof, as shown in fig. 6.

And step S406, carrying out state transition according to the laminated finite state machine to obtain entity attributes.

And taking the initial state as a starting point, and judging whether each state in the laminated finite state automaton can carry out state transition from bottom to top. The state transition function is a constraint condition of a rule represented by a parent state or a sub-rule, and whether state transition can be performed or not can be obtained by judging whether other states required by the rule or the sub-rule exist in the inverted index or not. If the state can be transferred, establishing an inverted index for the text matched with the state, and continuing to upwards judge whether the state can be transferred or not after the inverted index generated in the step S405 is added; if the rule cannot be transferred, the state is the most complex rule which is successfully matched, upward matching is terminated, the instance of the concept contained in the rule is output according to the inverted index, and the attribute value represented by the rule is obtained. In the matching process, an inverted index of the text content matched with each state is dynamically maintained.

For example, the rule file obtained in step S403 is input to obtain a layered finite state automaton, which is matched with the candidate text set U to obtain an attribute value, as shown in table 8.

TABLE 8

In this example, the above-mentioned 3 rules NAME _ CO LL EGE1, NAME _ CO LL E GE2 and NAME _ BIRTHDAY are input, and when matching the text segment, a layered finite state automaton is first constructed according to the dependency relationship of the rules, as shown in fig. 5.

According to the constructed cascade finite state automaton, the initial state is matched with the candidate text, and an inverted index is established for the matched contents of NAME, DATE, BITTH _ OR, GRADUATE, ORG, DEGREE _ GET _ OR and DEGREE, as shown in FIG. 6.

According to the layered finite state automata, state transition (i.e. S406) is carried out, from the initial state NAME matched to the state capable of being transitioned, NAME _ BITHDAY, NAME _ CO LL EGE1 rule and NAME _ CO LL EGE2 rule are included, whether the rest states required for transition are met OR not is checked in sequence, for example, NAME _ BITHDAY needs to have DATE and BITH _ OR within the interval 20, and if the results are met, the state is the terminated state, matching is stopped, an example of NAME variable "LIGENOJIE" is output as person, an example of DATE variable "5 month in 1943" is output as person, and if the results are met, an inverted index is established for NAME _ BITHDAY, so that the results are convenient to be searched as a part of a nested rule, NAME _ CO LL EGE1 rule needs to be further output as person LL E LL sub-E1 rule, if the matching is not found in the inverted index, the state transition rule is terminated, the matching is terminated, the state transition is continued from NAME _ CO 2 EGE matching, the state matching is continued to the state meeting condition of NACE _ CO LL, and the state meeting, and the state matching condition is also checked in sequence, if the state meeting the inverted index is continued, the state meeting condition is found, the condition is satisfied, the inverted index, the condition is found, the inverted index of the condition is found, and the inverted index of the state meeting condition of NAME _ EAGEE _ CO 467, the inverted index, the condition is continued, the state meeting condition is found, the inverted index of NAME _ CO 467 is found, the condition is found, the inverted index of NAME _ CO 2 matching is found, the inverted index is continued, the inverted index of NAME _ CO 2 matching is found, the inverted index of NAME _ CO 467 example found, the inverted index of the inverted index.

Wherein, the reverse indexes of the matching texts in each state are dynamically maintained in the matching process, and the finally generated reverse indexes are as shown in fig. 8.

In the sub-steps, a unified cascade finite state automaton is generated for all the rules, traversal is performed from bottom to top, repeated matching of the same sub-rules is avoided, and in the matching process, an inverted index is established from bottom to top, so that the matching speed is increased.

With reference to FIG. 9, a method of constructing a statistical model (specifically, a conditional random field model) and extracting attributes based on the statistical model is described.

In summary, the method tests the effect of different text features on the conditional random field model, selects the best text features (i.e. words, dependency relationships, parts of speech, word frequency) and sets the parameters of the template file. Extracting the entity attribute relation existing in the attribute box of the on-line encyclopedic page, returning to mark to automatically generate training data, and respectively training a conditional random field model M for each attribute_CRFAnd extracting the attribute of the obtained text candidate set U.

In detail, the method comprises the following sub-steps:

step S501, obtaining a training entity and a training attribute.

The contents of the attribute box (Infobox) of the online encyclopedia are obtained, and a known < entity-attribute > set is generated, thereby obtaining training entities and training attributes.

The Infobox is a tabular area which structurally describes the attribute of the vocabulary entry in the online encyclopedia vocabulary entry page.

Step S502, performing iterative query expansion on the training entity in the open web page set for training according to the method described in step S102, so as to obtain a text set (or training text set) for training.

Steps S503 to S507 are steps of extracting training data features. Wherein:

s503, segmenting sentences of the texts in the training text set, and performing model training by taking the sentences as units;

step S504, performing word segmentation on the sentence to obtain words contained in the sentence;

step S505, marking the part of speech of each word, such as noun, pronoun, verb, adjective and the like;

step S506, performing dependency relationship analysis on each word, and processing the dominance relationship among the words, for example, the process can be completed by using a dependency relationship tree;

step S507, calculating the word frequency of each word, i.e. the number of times each word appears in the text.

And extracting the words, the dependency relationship, the part of speech and the word frequency of the sentence as characteristics to be used as the characteristics of machine learning.

And step S508, generating training data.

Each known < entity-attribute > pair generated in step S501 is subjected to a back-marking on its feature data (where "back-marking" indicates whether each sentence feature is a positive example or a negative example of the attribute, and the feature data after back-marking is taken as training data), thereby generating training data for each attribute.

The generation of the training data comprises the generation of a positive example and the generation of a negative example, and the specific implementation process is as follows: if the working unit of the entity LiCogjie is known as the 'Chinese institute of computing', the feature data of all sentences containing the 'Chinese institute of computing' in the candidate text set of the entity LiCogjie is returned as a positive example; and carrying out named entity recognition on the rest sentences, and marking the characteristic data of the sentences containing the organization back as counterexamples.

And step S509, manual correction.

The training data generated in the artificial syndrome step S508 includes, but is not limited to:

(1) and removing impurities. For example, it is known that the work unit of an entity is a department of Chinese academy, all of the "department of Chinese academy institute" will be labeled as a positive example in the automatically generated training data, for example, "work at department of Chinese academy institute", and labeled as a positive example, and "department of Chinese academy institute is located at the department of Chinese academy", and will be wrongly labeled as a positive example, and it is necessary to manually correct and remove the wrong label;

(2) the proportion of positive and negative examples is controlled. If the number of positive cases is too large, more impurities are introduced into the extraction result; if the counter-example is too much, the recall rate of the extraction result is too low, so the ratio of the positive example to the counter-example (e.g. 1: 3) needs to be controlled.

Step S510, a template file is formulated, and the size of a row, a column and a window of conditional probability is determined according to context information.

Step S511, inputting the training data generated in step S509, and generating CRF model M for each attribute through supervised machine learning_CRF。

And S512, extracting attribute values according to the generated CRF model.

The feature extraction process of the candidate text set U obtained in step S102 is the same as the above substep S503, step S504, step S505, step S506, and step S507.

Selecting corresponding M according to target attribute_CRFAccording to the extracted features and corresponding M_CRFAnd extracting the target attribute of the target entity to obtain an attribute value.

Step S104: attribute correction

And correcting the extracted attribute value according to the restrictions of the part of speech, type, range and the like of the attribute, and removing the attribute which does not meet the requirement. For example, the attributes of a person, such as a child, a spouse, and a parent, the type of person should also be a person; as another example, an attribute describing age is typically a number between 1-120.

In one embodiment, the correction rules for attributes include, but are not limited to:

1) and attribute type checking: judging whether the extracted attribute values are correct or incorrect through the limitation of a certain attribute on the type, and rejecting the attribute values which are not matched with the type;

2) checking the value range: according to the part of speech (such as noun, number word, date, etc.) and data range of some attribute, the attribute value beyond the part of speech range is removed.

According to an embodiment of the present invention, an entity attribute extraction system facing an open web page is further provided, which includes a web page preprocessing module, a query expansion module, an attribute extraction module, and an attribute correction module.

The webpage preprocessing module is used for extracting texts of open webpages and establishing an inverted index; the query expansion module is used for acquiring a candidate text set of the target entity from the extracted text; the attribute extraction module is used for selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute appearing in the training text set; and the attribute correction module is used for correcting the extracted attribute value.

In order to verify the effectiveness of the open-web-page-oriented entity attribute extraction method and system provided by the invention, the inventor uses Slot Filling evaluation data of TAC KBP in 2014 to carry out experiments, and experimental parameters are as follows:

the experimental data set comprises 100 entities (50 of the 'human' type and 50 of the 'organization' type), and a total of 41 attributes to be extracted (25 of the 'human' type target attributes and 16 of the 'organization' type target attributes). Wherein, 30 attributes have high occurrence frequency and sufficient training data, a CRF method is selected to train the model for attribute extraction, the remaining 11 attributes have low occurrence frequency, and a CFT rule making method is adopted to extract the attributes. The total number of attribute values included in the experimental data set was 1001.

During the course of the experiment, the optimal parameter configuration was found. Wherein, when the query is expanded, an expansion window is selected to be 1, and the first 50 texts are selected in each expansion; the CRF training selects 4 text features: the words contained in the sentence, the part of speech of each word, the word frequency of each word and the dependency relationship among the words. When generating the training data of the CRF model, the proportion of positive examples and negative examples is 1: 3, the configuration can reach the highest recall rate and accuracy rate through verification.

Through experiments, the following results are obtained:

the total extraction results are 412, the hit results are 243, and the accuracy is 58.98%, in the existing extraction technology, the natural language processing group of Stanford university shows the best performance through a machine learning method for related words between entity pairs, the accuracy reaches 58.54%, the comprehensive multi-strategy searching method of the RPI B L ENDER team of Rensselaer Polytechnical Institute shows better performance, and the accuracy reaches 47.80%.

It should be understood that although the present description has been described in terms of various embodiments, not every embodiment includes only a single embodiment, and such description is for clarity purposes only, and those skilled in the art will recognize that the embodiments described herein may be combined as suitable to form other embodiments, as will be appreciated by those skilled in the art.

The above description is only an exemplary embodiment of the present invention, and is not intended to limit the scope of the present invention. Any equivalent alterations, modifications and combinations can be made by those skilled in the art without departing from the spirit and principles of the invention.

Claims

1. An entity attribute extraction method facing to open web pages comprises the following steps:

step 2), selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute appearing in the training text set, wherein the method comprises the following steps:

calculating the frequency of the target entity attribute in the training text set, if the frequency exceeds a preset threshold value, extracting the value of the target entity attribute according to the constructed statistical model, otherwise, extracting the value of the target entity attribute in an inverted index mode from the initial state of the constructed cascade finite state automaton for the text matched with the candidate text set; the training text set is used for training the statistical model, and the statistical model is a conditional random field model.

2. The method of claim 1, wherein step 1) comprises:

3. The method of claim 2, wherein the relevance of a plurality of words to unstructured text is the sum of the relevance of each of the plurality of words to the unstructured text.

4. The method of claim 1, wherein a stacked finite state automaton is constructed according to the following steps:

5. The method of claim 4, wherein extracting values of the target entity attributes according to the constructed hierarchical finite state automaton comprises:

6. The method of claim 1, wherein the statistical model is constructed according to the following steps:

7. The method of claim 6, wherein step B) comprises:

8. The method of claim 6, wherein step C) further comprises:

removing impurities in the training data, and controlling the proportion of positive examples to negative examples in the training data.

9. The method of claim 6, wherein the characteristics include words, dependencies between words, word frequencies and parts of speech of words.

10. The method of any of claims 6-9, wherein extracting values for the attributes of the target entity according to the constructed statistical model comprises:

11. The method according to any one of claims 1-3, further comprising:

12. An open web page-oriented entity attribute extraction system comprises:

the attribute extraction module is used for selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute in the training text set, wherein the rule-based mode or the statistical-based mode is used for calculating the frequency of the target entity attribute in the training text set, if the frequency exceeds a preset threshold value, the value of the target entity attribute is extracted according to a constructed statistical model, otherwise, the value of the target entity attribute is extracted in an inverted index mode from the initial state of the constructed cascade finite state automaton aiming at the text matched with the candidate text set; the training text set is used for training the statistical model, and the statistical model is a conditional random field model.