CN104636466B - Entity attribute extraction method and system for open webpage - Google Patents

Entity attribute extraction method and system for open webpage Download PDF

Info

Publication number
CN104636466B
CN104636466B CN201510071993.6A CN201510071993A CN104636466B CN 104636466 B CN104636466 B CN 104636466B CN 201510071993 A CN201510071993 A CN 201510071993A CN 104636466 B CN104636466 B CN 104636466B
Authority
CN
China
Prior art keywords
training
text set
target entity
attribute
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510071993.6A
Other languages
Chinese (zh)
Other versions
CN104636466A (en
Inventor
程学旗
贾岩涛
赵泽亚
王元卓
靳小龙
熊锦华
李曼玲
林海伦
许洪波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201510071993.6A priority Critical patent/CN104636466B/en
Publication of CN104636466A publication Critical patent/CN104636466A/en
Application granted granted Critical
Publication of CN104636466B publication Critical patent/CN104636466B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an entity attribute extraction method and system for an open webpage. Wherein the method comprises the following steps: extracting texts of the open web pages, and obtaining a candidate text set of a target entity from the texts; and selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute appearing in the training text set. The method and the device can improve the accuracy and the recall rate of the extraction of the entity attributes of the open webpage, do not depend on the webpage structure, and can adapt to the change of the types of the open webpage.

Description

Entity attribute extraction method and system for open webpage
Technical Field
The invention relates to the technical field of data mining, in particular to an entity attribute extraction method and system for an open webpage.
Background
The open web page refers to an unstructured internet web page with a data source not fixed and containing various network data, such as blogs, forums, news, chat records, e-mails and the like, wherein the nature and the amount of the information are not fixed, and the occurrence positions of the information are not fixed, and all the contents are unpredictable. With the development of network technology, especially the rapid development of Internet and Intranet technology, the number of open web pages is rapidly increased and difficulties are brought to the text understanding due to the self characteristic of flexible structure:
1. the text structure is not fixed, and no specific context grammar exists;
2. the scope of the keywords is not fixed, and the related subject fields are various;
3. the text length is not fixed, and the difference of the context information quantity is large;
4. the data source is not fixed, and the language phenomenon is complex.
An entity refers to things that exist objectively and can be distinguished from each other, and may be a specific objective object or an abstract event. The entity attribute refers to the nature of an entity, and entity attribute extraction is a key technology for text understanding, wherein the entity attribute extraction is used for reflecting the relevant conditions of the entity from different angles by concentrating the attributes of different information sources to the entity, so that the knowledge of the entity is perfected, and the entity attribute extraction plays an important role in research such as information extraction, event tracking, name disambiguation and the like.
Aiming at the characteristics of an open webpage, the traditional entity attribute extraction method has the following limitations:
the text structure of the first open webpage is not fixed, the entity and the description thereof have no fixed rule and can be circulated, and most of the entity and the description thereof are in free text and are not easy to extract and analyze;
secondly, in the traditional attribute extraction method facing the rules, the rules define the deadlines, the context grammar is excessively depended on, and the matching efficiency is low;
thirdly, the data source of the open webpage is not fixed, the language phenomenon is complex, common rules are difficult to cover, and the traditional rule-based attribute extraction does not support the nesting matching of the rules;
fourth, the traditional entity attribute extraction method based on statistics, the preparation of training data is too dependent on manpower, the efficiency is not high, and the accuracy and recall rate are low;
fifth, the traditional attribute extraction is mostly limited to be performed in a certain field or subject, the system cannot be directly transplanted to other fields or subjects for use, and the system lacks the association characteristics with universality and is not easy to be transplanted and expanded.
Disclosure of Invention
To solve the above problem, according to an embodiment of the present invention, an open web page-oriented entity attribute extraction method is provided, including:
step 1), extracting texts of open webpages, and obtaining a candidate text set of a target entity from the texts;
and 2) selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute in the training text set.
In the above method, step 1) comprises:
step 11), extracting an unstructured text from the open webpage, and segmenting words of the unstructured text to obtain the correlation degree between the words and the unstructured text;
step 12), one or more initial query expansion words closest to the target entity in the context of the target entity are obtained, and one or more unstructured texts with the highest relevance with the target entity and the one or more initial query expansion words are used as a first text set;
step 13), selecting one or more secondary query expansion words with highest word frequency from the first text set, and taking one or more unstructured texts with highest relevance with a target entity and the one or more secondary query expansion words as a second text set;
step 14), taking the union of the first text set and the second text set as a candidate text set of a target entity.
In the method, the relevance of the plurality of words to the unstructured text is the sum of the relevance of each of the plurality of words to the unstructured text.
In the above method, step 2) includes: calculating the frequency of the target entity attribute in the training text set, if the frequency exceeds a preset threshold value, extracting the value of the target entity attribute according to the constructed statistical model, otherwise, extracting the value of the target entity attribute according to the constructed cascade finite state automaton; wherein the set of training texts is used for training the statistical model.
In the method, a laminated finite state automaton is constructed according to the following steps:
step a), carrying out entity recognition in the candidate text set and generating a concept file; wherein the concept file comprises basic concepts indicating an entity type and entities belonging to the type and identified from the candidate text set; indicating a regular expression of a variable to be extracted; and, a token indicating a relationship between the entity and the attribute;
step b), generating a rule file comprising the concept file and the association rule; the association rule comprises a single rule or a rule nested with a plurality of sub-rules and is used for indicating the relationship among the basic concepts, the regular expressions and the signposts in the concept file;
step c), constructing a laminated finite state automaton according to the association rules in the rule file; wherein, the initial state of the laminated finite state automaton is a basic concept, a regular expression or a landmark; other states include association rules and sub-rules in association rules.
In the method, extracting the value of the target entity attribute according to the constructed cascade finite state automaton comprises the following steps:
matching the candidate text set with the finite state automaton from an initial state, and establishing an inverted index for the content matched in the candidate text set by each state;
and after matching is finished, obtaining the value of the target entity attribute from the established inverted index.
In the above method, the statistical model is constructed according to the following steps:
step A), obtaining training entities and corresponding training attributes from an online encyclopedia;
step B), obtaining a training text set of the training entity from a training open webpage;
step C), extracting features from the training text set, and performing label returning on the features of the training attributes to obtain training data of each attribute;
and D) generating a statistical model corresponding to each attribute according to the training data.
In the above method, step B) comprises:
step B1), extracting an unstructured text from the training open webpage, and segmenting the unstructured text to obtain the correlation degree between words and the unstructured text;
step B2), obtaining n initial query expansion words nearest to the training entity according to the context information of the training entity in the training open webpage, and taking K unstructured texts with the highest correlation degree with the training entity and the initial query expansion words as a third text set; wherein n and K are positive integers;
step B3), m secondary query expansion words with the highest word frequency are selected from the third text set, and L unstructured texts with the highest correlation degree with the training entities and the secondary query expansion words are used as a fourth text set, wherein m and L are positive integers;
step B4), taking the union of the third text set and the fourth text set as a training text set.
In the above method, step C) further comprises: removing impurities in the training data, and controlling the proportion of positive examples to negative examples in the training data.
In the above method, the characteristics include words, dependency relationships between words, word frequencies and word properties of words.
In the above method, extracting the value of the target entity attribute according to the constructed statistical model includes:
extracting features of the candidate text set in a manner that features are extracted when constructing the statistical model;
and inputting the extracted features into a statistical model corresponding to the target entity attribute to obtain the value of the target entity attribute.
The method further comprises the following steps:
and 3) correcting the extracted value of the target entity attribute according to the type, the part of speech or the value range of the target entity attribute.
According to an embodiment of the present invention, there is also provided an open web page-oriented entity attribute extraction system, including:
the webpage preprocessing module is used for extracting the text of the open webpage;
the query expansion module is used for obtaining a candidate text set of the target entity from the extracted text;
and the attribute extraction module is used for selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute appearing in the training text set.
The invention has the following beneficial effects:
1. an entity attribute extraction method based on a laminated finite state automaton is provided, and extraction of complex nesting rules is achieved;
2. in the extraction process based on the laminated finite state automaton, an inverted index is established for the extracted content of each state of the automaton, so that the rule matching efficiency is greatly improved;
3. a set of concept definition and rule definition language of irrelevant grammar is formulated, so that entity attribute extraction is separated from a context language environment, declarative information extraction is realized, and system compatibility is improved;
4. a set of sentence-level text features are provided for CRF model training, and the machine learning effect in attribute extraction can be improved;
5. the method for automatically generating the CRF training data according to the existing attribute information of the online encyclopedia attribute box (Infobox) is provided, and the part needing manual verification is provided for the effect of label return, so that the efficiency and the accuracy of the training data are improved;
6. the iterative query expansion method is provided, and the accuracy and recall rate of entity attribute extraction of the open webpage can be improved by verification;
7. and the entity attribute extraction of the open webpage is realized by adopting a rule-based or statistic-based extraction method in a self-adaptive manner according to the occurrence frequency of the attribute.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
FIG. 1 is a flowchart of an open web page-oriented entity attribute extraction method according to an embodiment of the present invention;
FIG. 2 is a flow diagram of an iterative query expansion method according to one embodiment of the present invention;
FIG. 3 is a flow diagram of a method of adaptive entity attribute extraction according to one embodiment of the present invention;
FIG. 4 is a flow diagram of a method for constructing a stacked finite state automaton and for attribute extraction based on association rules of the stacked finite state automaton, according to an embodiment of the invention;
FIG. 5 is a schematic diagram of a stacked finite state automaton according to one embodiment of the invention;
FIG. 6 is a schematic diagram of an initial inverted index, according to one embodiment of the invention;
FIG. 7 is a diagram of a layered finite state automaton matching a set of candidate texts, according to one embodiment of the invention;
FIG. 8 is a schematic diagram of an inverted index upon completion of a match, in accordance with one embodiment of the present invention;
FIG. 9 is a flow diagram of a method for constructing a conditional random field model and extracting properties based on the conditional random field model according to one embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
According to one embodiment of the invention, an entity attribute extraction method facing an open webpage is provided.
In summary, the method comprises: extracting texts of the open web pages, and obtaining a candidate text set of a target entity from the texts; and selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute in the training text set.
Before describing the method for extracting the entity attribute facing the open web page, the entity attribute, the rule and the statistical model are explained first. The entity attribute comprises a target entity, an attribute name and an attribute value; the rules comprise rule types, target names and parameters and rule bodies, and the text sources of the features used by the statistical model comprise texts before attribute names, texts between the attribute names and attribute values and texts after the attribute values.
The steps of the method for extracting the entity attribute of the open web page will now be described in detail with reference to fig. 1. It should be noted that, the steps of the method described in the specification are not necessarily required, and one or more of the steps may be omitted or replaced according to the actual situation, and in addition, the order between the steps may be adjusted.
Step S101: open web page preprocessing
According to one embodiment of the invention, the preprocessing process of the open webpage comprises the following steps:
1. and acquiring an open webpage set to be extracted, and extracting webpage contents to obtain an unstructured text to be extracted.
2. The method comprises the steps of segmenting words of unstructured texts to be extracted, calculating the relevance between the words and each unstructured text to obtain a highest relevance (or matching degree) unstructured text set corresponding to each word, and establishing an inverted index according to the information.
In one embodiment, the degree of correlation between words and unstructured text is calculated based on word frequency and other characteristics. For example, the TF-IDF method can be used to obtain the relevance of a word to all unstructured texts, and then the k (k is a positive integer) unstructured texts with the highest relevance are used as the set of unstructured texts with the highest relevance of the word.
Step S102: obtaining candidate text set through iterative query expansion
And according to the inverted index established in the step S101, generating a candidate text set by inquiring, expanding and fusing the context information and the word frequency information of the target entity twice. FIG. 2 depicts one embodiment of the steps of an iterative query expansion method, comprising:
step S201, according to the context information of the target entity E, acquiring n (n is a positive integer) entities (words) closest to E in the context, which are called query expansion words.
In one embodiment, n-1 is selected, i.e., the words E1 and E2 before and after the target entity E are used as query expansion words.
Step S202, initial query expansion.
Querying a target entity and query expansion words in the inverted index established in step S101 to obtain a text set U of the unstructured text with the highest degree of correlation with the target entity and the query expansion words1
In one embodiment, the sum of the relevance of the target entity, the query expansion word and a certain unstructured text is taken as the relevance of the target entity, the query expansion word and the unstructured text and sequenced, so that a text set U is obtained1(e.g. including 50 texts). In another embodiment, the highest-relevance unstructured text sets of the target entity and the query expansion word are respectively found, and the intersection is taken to obtain a text set U1. Experiments show that the accuracy of entity attribute extraction can be improved in the initial query expansion process.
Step S203, the slave U1M (m is a positive integer) words with the highest word frequency are selected.
In one embodiment, m is 2, i.e., U is selected1The two entities with the highest word frequency, E3 and E4, perform the second query expansion.
And step S204, secondary query expansion.
The m words with the highest word frequency and the target entity E are inquired in the inverted index again to obtain a text set U with the highest relevance degree2. For example, the method in step S202 is adopted to obtain U2. Experiments show that the step can effectively promote the recall of entity attribute extractionRecovery rate and accuracy.
Step S205, taking the union set of the results of the two query expansion as a candidate text set U (referred to as candidate text set U for short) extracted by the entity attributes of the target entity E.
Step S103: adaptive entity attribute extraction
In summary, the adaptive entity attribute extraction process includes: according to the frequency of the target entity attribute (or target attribute) appearing in the training text set, different entity attribute extraction methods are selected in a self-adaptive mode. The training text set is a text set used for training a statistical model (the model is used for a statistical-based entity attribute extraction method and will be described below), and the frequency of the target attribute in the training text set can be obtained according to the frequency of the following token word in the training text set. Here, if the frequency of occurrence is higher than a predetermined threshold, a statistical-based entity attribute extraction method is employed, otherwise a rule-based entity attribute extraction method is employed. The reason for this is that: for entity attributes with lower occurrence frequency, the accuracy and the execution efficiency of the method based on the rules are better; for attributes with higher frequency of occurrence, the statistical-based approach is chosen more comprehensively.
The entity attribute extraction method based on the rules can realize rule nesting by constructing a laminated finite state automaton, and establishes an inverted index for text contents matched with each state (or called node) of the laminated finite state automaton, so as to quickly realize matching of complex text modes and obtain entity attribute values; the entity attribute extraction method based on statistics can be used for carrying out supervised machine learning according to the conditional random field principle, selecting text characteristics and training a statistical model (such as a conditional random field model) to extract entity attributes. As shown in FIG. 3, the adaptive entity attribute extraction process may include the following sub-steps:
and S301, constructing a layered finite state automaton.
And (3) carrying out entity recognition on the candidate text set U, formulating a set of declarative language specifications of irrelevant grammars, defining a concept set and an association rule set, and constructing a layered finite state automaton according to the nesting dependency relationship of the rules.
And S302, training a statistical model.
Selecting text features, generating training data, and training to obtain statistical model, such as CRF model MCRF
Step S303, calculating the frequency of the target entity attribute in the training text set, and judging whether the frequency exceeds a preset threshold value.
And step S304, if the judgment result in the step S303 is negative, adopting an association rule based on a laminated finite state automaton to extract the attribute (namely, adopting a rule-based entity attribute extraction method).
If the determination result in step S305 is yes, the attribute extraction is performed by machine learning based on the conditional random field (that is, the entity attribute extraction method based on statistics). Sentence-level feature extraction is performed on the candidate text set U to generate a feature vector, the statistical model generated in the step S302 is input, and a target attribute value is extracted.
It should be understood that the order of the above sub-steps may be reversed, for example, the sub-step of training the statistical model may be performed at any time before or simultaneously with the construction of the stacked finite state automaton.
The foregoing general description of the adaptive entity attribute extraction process is described, and then the attribute extraction is performed on the basis of the stacked finite state automata to construct the stacked finite state automata respectively; training statistical models based on statistical models (in particular conditional random field model M)CRF) The process of performing attribute extraction is described in detail.
Fig. 4 depicts a method for constructing a stacked finite state automaton and extracting attributes based on association rules of the stacked finite state automaton, and the following substeps of the method are:
and step S401, entity identification.
And carrying out named entity identification in the candidate text set U to obtain an entity set, and determining the types of the entities, such as people, places, mechanisms and the like.
Step S402, generating a concept file.
The CONCEPT file is a collection of all CONCEPTs, including the CONCEPT basic CONCEPT, the REGEX regular expression, and the CONCEPT logograms. Declarative definition using a language that is independent of the contextual grammar:
(1) the CONCEPT of the CONCEPT is a type variable to be extracted, such as people, places, organizations, etc. (a type is defined as a CONCEPT object). The definition format is CONCEPT: [ CONCEPT name ]: [ example value ], wherein the CONCEPT name is a type variable, and the example value is an entity set of the type and can be regarded as a CONCEPT pair of < type variable-type variable value range >. Wherein, the basic concept is obtained from the entity set generated in step S401. For example: all entities of the "organization" type are referred to as a concept named ORG, whose value range is an example of the type of organization in the text, such as graduate institute of science and technology university, beijing university, chinese institute of technology, etc., and the ORG concept is defined as shown in table 1.
TABLE 1
Figure BDA0000670711360000091
(2) The REGEX regular expression is a regular expression of variables to be extracted, and has a format of REGEX: [ concept name ]: [ regular expression content ], and an example of the REGEX regular expression is given in table 2.
TABLE 2
REGEX:DATE:([\d]Year {4} {0,1} ([ \ d)]{1,2} month) {0,1} ([ \ d)]{1,2} day) {0,1}
Where DATE is the name of the regular expression, and ([ \ d ] {4} year) {0,1} ([ \ d ] {1,2} month) {0,1} ([ \ d ] {1,2} day) {0,1} is the regular expression referred to by DATE.
(3) The CONCEPT flag word is a flag word related to the attribute to be extracted, namely a relation flag word of an entity and the attribute, is used for making an association rule and is in a format of CONCEPT: [ CONCEPT name ]: [ flag word value ]. For example, when the attribute "date of birth" of a person is to be extracted, the required flags can be as shown in table 3.
TABLE 3
Figure BDA0000670711360000092
The above combinations of CONCEPT and REGEX are merged into an overall CONCEPT set, i.e. a CONCEPT file is generated, as shown in Table 4.
TABLE 4
Figure BDA0000670711360000093
Figure BDA0000670711360000101
And step S403, generating a rule file.
The association rule MCONCEPT _ RU L E characterizes the relationship between concepts by Boolean logic constraints and context constraints on the concepts in the format of MCONCEPT _ RU L E: [ rule name ] ([ to-be-output variable ]):
(1) AND: the character strings in which all clauses appear are matched;
(2) OR: as long as a clause appears, the character string is matched;
(3) SENT: all clauses appear in the same sentence, the character string will be matched;
(4) ORD: all the clauses appear simultaneously according to the sequence defined by the rule, and the character string is matched;
(5) DIST _ n: when all clauses appear in a string at the same time and the distance (separation distance) between adjacent clause instances does not exceed n words, the string is matched.
CONCEPT and REGEX generated in step S402 are both CONCEPTs to be extracted, the matching text of which is an instance of the CONCEPT, as shown in Table 5, which extracts the person' S "date of birth". fig. "
TABLE 5
MCONCEPT_RULE:NAME_BIRTHDAY(person,birthday):(DIST_20,
"_person{NAME}","BIRTH_OR","_birthday{DATE}")
Where, NAME _ bithday is a rule NAME, (person, BIRTHDAY) indicates that two variables of person and BIRTHDAY are output, where person is an example of NAME concept and BIRTHDAY is an example of DATE variable. The meaning of NAME _ BIRTHDAY is: if the NAME, BITRH _ OR, and instances of DATE concepts appear simultaneously and are no more than 20 words apart, the clause matching NAME (i.e., the instance of NAME) is output as person and the clause matching DATE is output as birthday.
For example, Table 6 shows the NAME _ CO LL EGE rule, extracting "graduation time and college" of a person.
TABLE 6
Figure BDA0000670711360000111
Table 6 shows the following: first, the successful (ORD, "DEGREE _ GET _ OR," "DE GREE") sub-rule must be matched, i.e., there is a "DEGREE" sub-sentence that appears after "DEGREE _ GET _ OR"; secondly, after the sub-rules are successfully matched, if the sentence in which the sub-sentence is located has the NAME, DATE and ORG concept instances, the sub-sentence with NAME matching (i.e. the NAME instance) is output as person, the sub-sentence with DATE matching is output as graduatetime and the sub-sentence with ORGANIZATI ON matching is output as college.
The rule file was generated by all MCONCEPT _ RU L E and all concept collections, annotated with "#" beginning to represent the row, and table 7 shows the rule file for the attributes "date of birth", "graduate school", "contact address" of the abstractor.
TABLE 7
Figure BDA0000670711360000112
Figure BDA0000670711360000121
Figure BDA0000670711360000131
And step S404, constructing a layered finite state automaton.
And transferring the rules into a group of finite state automata with a dependency relationship with each other according to the nesting dependency relationship among the rules, wherein each concept is an initial state, and the laminated finite state automata is generated step by step through the constraint relationship among the concepts and the rule nesting dependency relationship. The cascade finite state automaton is tree-shaped, the bottom layer is an initial state, and the state which can be transferred can be regarded as a parent state; the initial state is a concept, the parent state is a rule or a sub-rule, and the transfer function is a constraint condition of the rule or the sub-rule. And gradually transferring upwards from the initial state to a rule or sub-rule state through the constraint conditions and the nesting relation of the rules to form the layered finite state automaton, which is shown in fig. 5.
And S405, matching the candidate text set U with the initial state of the laminated finite state automata, and establishing an initial inverted index.
In the step, the candidate text set U is matched with the initial state, namely, all concepts are matched, and an inverted index is established for the text matched with the initial state. Each state is used as a term, the text matched with the state is used as a reverse record table of the term, and each term has a pointer pointing to the reverse record table thereof, as shown in fig. 6.
And step S406, carrying out state transition according to the laminated finite state machine to obtain entity attributes.
And taking the initial state as a starting point, and judging whether each state in the laminated finite state automaton can carry out state transition from bottom to top. The state transition function is a constraint condition of a rule represented by a parent state or a sub-rule, and whether state transition can be performed or not can be obtained by judging whether other states required by the rule or the sub-rule exist in the inverted index or not. If the state can be transferred, establishing an inverted index for the text matched with the state, and continuing to upwards judge whether the state can be transferred or not after the inverted index generated in the step S405 is added; if the rule cannot be transferred, the state is the most complex rule which is successfully matched, upward matching is terminated, the instance of the concept contained in the rule is output according to the inverted index, and the attribute value represented by the rule is obtained. In the matching process, an inverted index of the text content matched with each state is dynamically maintained.
For example, the rule file obtained in step S403 is input to obtain a layered finite state automaton, which is matched with the candidate text set U to obtain an attribute value, as shown in table 8.
TABLE 8
Figure BDA0000670711360000141
Figure BDA0000670711360000151
In this example, the above-mentioned 3 rules NAME _ CO LL EGE1, NAME _ CO LL E GE2 and NAME _ BIRTHDAY are input, and when matching the text segment, a layered finite state automaton is first constructed according to the dependency relationship of the rules, as shown in fig. 5.
According to the constructed cascade finite state automaton, the initial state is matched with the candidate text, and an inverted index is established for the matched contents of NAME, DATE, BITTH _ OR, GRADUATE, ORG, DEGREE _ GET _ OR and DEGREE, as shown in FIG. 6.
According to the layered finite state automata, state transition (i.e. S406) is carried out, from the initial state NAME matched to the state capable of being transitioned, NAME _ BITHDAY, NAME _ CO LL EGE1 rule and NAME _ CO LL EGE2 rule are included, whether the rest states required for transition are met OR not is checked in sequence, for example, NAME _ BITHDAY needs to have DATE and BITH _ OR within the interval 20, and if the results are met, the state is the terminated state, matching is stopped, an example of NAME variable "LIGENOJIE" is output as person, an example of DATE variable "5 month in 1943" is output as person, and if the results are met, an inverted index is established for NAME _ BITHDAY, so that the results are convenient to be searched as a part of a nested rule, NAME _ CO LL EGE1 rule needs to be further output as person LL E LL sub-E1 rule, if the matching is not found in the inverted index, the state transition rule is terminated, the matching is terminated, the state transition is continued from NAME _ CO 2 EGE matching, the state matching is continued to the state meeting condition of NACE _ CO LL, and the state meeting, and the state matching condition is also checked in sequence, if the state meeting the inverted index is continued, the state meeting condition is found, the condition is satisfied, the inverted index, the condition is found, the inverted index of the condition is found, and the inverted index of the state meeting condition of NAME _ EAGEE _ CO 467, the inverted index, the condition is continued, the state meeting condition is found, the inverted index of NAME _ CO 467 is found, the condition is found, the inverted index of NAME _ CO 2 matching is found, the inverted index is continued, the inverted index of NAME _ CO 2 matching is found, the inverted index of NAME _ CO 467 example found, the inverted index of the inverted index.
Wherein, the reverse indexes of the matching texts in each state are dynamically maintained in the matching process, and the finally generated reverse indexes are as shown in fig. 8.
In the sub-steps, a unified cascade finite state automaton is generated for all the rules, traversal is performed from bottom to top, repeated matching of the same sub-rules is avoided, and in the matching process, an inverted index is established from bottom to top, so that the matching speed is increased.
With reference to FIG. 9, a method of constructing a statistical model (specifically, a conditional random field model) and extracting attributes based on the statistical model is described.
In summary, the method tests the effect of different text features on the conditional random field model, selects the best text features (i.e. words, dependency relationships, parts of speech, word frequency) and sets the parameters of the template file. Extracting the entity attribute relation existing in the attribute box of the on-line encyclopedic page, returning to mark to automatically generate training data, and respectively training a conditional random field model M for each attributeCRFAnd extracting the attribute of the obtained text candidate set U.
In detail, the method comprises the following sub-steps:
step S501, obtaining a training entity and a training attribute.
The contents of the attribute box (Infobox) of the online encyclopedia are obtained, and a known < entity-attribute > set is generated, thereby obtaining training entities and training attributes.
The Infobox is a tabular area which structurally describes the attribute of the vocabulary entry in the online encyclopedia vocabulary entry page.
Step S502, performing iterative query expansion on the training entity in the open web page set for training according to the method described in step S102, so as to obtain a text set (or training text set) for training.
Steps S503 to S507 are steps of extracting training data features. Wherein:
s503, segmenting sentences of the texts in the training text set, and performing model training by taking the sentences as units;
step S504, performing word segmentation on the sentence to obtain words contained in the sentence;
step S505, marking the part of speech of each word, such as noun, pronoun, verb, adjective and the like;
step S506, performing dependency relationship analysis on each word, and processing the dominance relationship among the words, for example, the process can be completed by using a dependency relationship tree;
step S507, calculating the word frequency of each word, i.e. the number of times each word appears in the text.
And extracting the words, the dependency relationship, the part of speech and the word frequency of the sentence as characteristics to be used as the characteristics of machine learning.
And step S508, generating training data.
Each known < entity-attribute > pair generated in step S501 is subjected to a back-marking on its feature data (where "back-marking" indicates whether each sentence feature is a positive example or a negative example of the attribute, and the feature data after back-marking is taken as training data), thereby generating training data for each attribute.
The generation of the training data comprises the generation of a positive example and the generation of a negative example, and the specific implementation process is as follows: if the working unit of the entity LiCogjie is known as the 'Chinese institute of computing', the feature data of all sentences containing the 'Chinese institute of computing' in the candidate text set of the entity LiCogjie is returned as a positive example; and carrying out named entity recognition on the rest sentences, and marking the characteristic data of the sentences containing the organization back as counterexamples.
And step S509, manual correction.
The training data generated in the artificial syndrome step S508 includes, but is not limited to:
(1) and removing impurities. For example, it is known that the work unit of an entity is a department of Chinese academy, all of the "department of Chinese academy institute" will be labeled as a positive example in the automatically generated training data, for example, "work at department of Chinese academy institute", and labeled as a positive example, and "department of Chinese academy institute is located at the department of Chinese academy", and will be wrongly labeled as a positive example, and it is necessary to manually correct and remove the wrong label;
(2) the proportion of positive and negative examples is controlled. If the number of positive cases is too large, more impurities are introduced into the extraction result; if the counter-example is too much, the recall rate of the extraction result is too low, so the ratio of the positive example to the counter-example (e.g. 1: 3) needs to be controlled.
Step S510, a template file is formulated, and the size of a row, a column and a window of conditional probability is determined according to context information.
Step S511, inputting the training data generated in step S509, and generating CRF model M for each attribute through supervised machine learningCRF
And S512, extracting attribute values according to the generated CRF model.
The feature extraction process of the candidate text set U obtained in step S102 is the same as the above substep S503, step S504, step S505, step S506, and step S507.
Selecting corresponding M according to target attributeCRFAccording to the extracted features and corresponding MCRFAnd extracting the target attribute of the target entity to obtain an attribute value.
Step S104: attribute correction
And correcting the extracted attribute value according to the restrictions of the part of speech, type, range and the like of the attribute, and removing the attribute which does not meet the requirement. For example, the attributes of a person, such as a child, a spouse, and a parent, the type of person should also be a person; as another example, an attribute describing age is typically a number between 1-120.
In one embodiment, the correction rules for attributes include, but are not limited to:
1) and attribute type checking: judging whether the extracted attribute values are correct or incorrect through the limitation of a certain attribute on the type, and rejecting the attribute values which are not matched with the type;
2) checking the value range: according to the part of speech (such as noun, number word, date, etc.) and data range of some attribute, the attribute value beyond the part of speech range is removed.
According to an embodiment of the present invention, an entity attribute extraction system facing an open web page is further provided, which includes a web page preprocessing module, a query expansion module, an attribute extraction module, and an attribute correction module.
The webpage preprocessing module is used for extracting texts of open webpages and establishing an inverted index; the query expansion module is used for acquiring a candidate text set of the target entity from the extracted text; the attribute extraction module is used for selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute appearing in the training text set; and the attribute correction module is used for correcting the extracted attribute value.
In order to verify the effectiveness of the open-web-page-oriented entity attribute extraction method and system provided by the invention, the inventor uses Slot Filling evaluation data of TAC KBP in 2014 to carry out experiments, and experimental parameters are as follows:
the experimental data set comprises 100 entities (50 of the 'human' type and 50 of the 'organization' type), and a total of 41 attributes to be extracted (25 of the 'human' type target attributes and 16 of the 'organization' type target attributes). Wherein, 30 attributes have high occurrence frequency and sufficient training data, a CRF method is selected to train the model for attribute extraction, the remaining 11 attributes have low occurrence frequency, and a CFT rule making method is adopted to extract the attributes. The total number of attribute values included in the experimental data set was 1001.
During the course of the experiment, the optimal parameter configuration was found. Wherein, when the query is expanded, an expansion window is selected to be 1, and the first 50 texts are selected in each expansion; the CRF training selects 4 text features: the words contained in the sentence, the part of speech of each word, the word frequency of each word and the dependency relationship among the words. When generating the training data of the CRF model, the proportion of positive examples and negative examples is 1: 3, the configuration can reach the highest recall rate and accuracy rate through verification.
Through experiments, the following results are obtained:
the total extraction results are 412, the hit results are 243, and the accuracy is 58.98%, in the existing extraction technology, the natural language processing group of Stanford university shows the best performance through a machine learning method for related words between entity pairs, the accuracy reaches 58.54%, the comprehensive multi-strategy searching method of the RPI B L ENDER team of Rensselaer Polytechnical Institute shows better performance, and the accuracy reaches 47.80%.
It should be understood that although the present description has been described in terms of various embodiments, not every embodiment includes only a single embodiment, and such description is for clarity purposes only, and those skilled in the art will recognize that the embodiments described herein may be combined as suitable to form other embodiments, as will be appreciated by those skilled in the art.
The above description is only an exemplary embodiment of the present invention, and is not intended to limit the scope of the present invention. Any equivalent alterations, modifications and combinations can be made by those skilled in the art without departing from the spirit and principles of the invention.

Claims (12)

1. An entity attribute extraction method facing to open web pages comprises the following steps:
step 1), extracting texts of open webpages, and obtaining a candidate text set of a target entity from the texts;
step 2), selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute appearing in the training text set, wherein the method comprises the following steps:
calculating the frequency of the target entity attribute in the training text set, if the frequency exceeds a preset threshold value, extracting the value of the target entity attribute according to the constructed statistical model, otherwise, extracting the value of the target entity attribute in an inverted index mode from the initial state of the constructed cascade finite state automaton for the text matched with the candidate text set; the training text set is used for training the statistical model, and the statistical model is a conditional random field model.
2. The method of claim 1, wherein step 1) comprises:
step 11), extracting an unstructured text from the open webpage, and segmenting words of the unstructured text to obtain the correlation degree between the words and the unstructured text;
step 12), one or more initial query expansion words closest to the target entity in the context of the target entity are obtained, and one or more unstructured texts with the highest relevance with the target entity and the one or more initial query expansion words are used as a first text set;
step 13), selecting one or more secondary query expansion words with highest word frequency from the first text set, and taking one or more unstructured texts with highest relevance with a target entity and the one or more secondary query expansion words as a second text set;
step 14), taking the union of the first text set and the second text set as a candidate text set of a target entity.
3. The method of claim 2, wherein the relevance of a plurality of words to unstructured text is the sum of the relevance of each of the plurality of words to the unstructured text.
4. The method of claim 1, wherein a stacked finite state automaton is constructed according to the following steps:
step a), carrying out entity recognition in the candidate text set and generating a concept file; wherein the concept file comprises basic concepts indicating an entity type and entities belonging to the type and identified from the candidate text set; indicating a regular expression of a variable to be extracted; and, a token indicating a relationship between the entity and the attribute;
step b), generating a rule file comprising the concept file and the association rule; the association rule comprises a single rule or a rule nested with a plurality of sub-rules and is used for indicating the relationship among the basic concepts, the regular expressions and the signposts in the concept file;
step c), constructing a laminated finite state automaton according to the association rules in the rule file; wherein, the initial state of the laminated finite state automaton is a basic concept, a regular expression or a landmark; other states include association rules and sub-rules in association rules.
5. The method of claim 4, wherein extracting values of the target entity attributes according to the constructed hierarchical finite state automaton comprises:
matching the candidate text set with the finite state automaton from an initial state, and establishing an inverted index for the content matched in the candidate text set by each state;
and after matching is finished, obtaining the value of the target entity attribute from the established inverted index.
6. The method of claim 1, wherein the statistical model is constructed according to the following steps:
step A), obtaining training entities and corresponding training attributes from an online encyclopedia;
step B), obtaining a training text set of the training entity from a training open webpage;
step C), extracting features from the training text set, and performing label returning on the features of the training attributes to obtain training data of each attribute;
and D) generating a statistical model corresponding to each attribute according to the training data.
7. The method of claim 6, wherein step B) comprises:
step B1), extracting an unstructured text from the training open webpage, and segmenting the unstructured text to obtain the correlation degree between words and the unstructured text;
step B2), obtaining n initial query expansion words nearest to the training entity according to the context information of the training entity in the training open webpage, and taking K unstructured texts with the highest correlation degree with the training entity and the initial query expansion words as a third text set; wherein n and K are positive integers;
step B3), m secondary query expansion words with the highest word frequency are selected from the third text set, and L unstructured texts with the highest correlation degree with the training entities and the secondary query expansion words are used as a fourth text set, wherein m and L are positive integers;
step B4), taking the union of the third text set and the fourth text set as a training text set.
8. The method of claim 6, wherein step C) further comprises:
removing impurities in the training data, and controlling the proportion of positive examples to negative examples in the training data.
9. The method of claim 6, wherein the characteristics include words, dependencies between words, word frequencies and parts of speech of words.
10. The method of any of claims 6-9, wherein extracting values for the attributes of the target entity according to the constructed statistical model comprises:
extracting features of the candidate text set in a manner that features are extracted when constructing the statistical model;
and inputting the extracted features into a statistical model corresponding to the target entity attribute to obtain the value of the target entity attribute.
11. The method according to any one of claims 1-3, further comprising:
and 3) correcting the extracted value of the target entity attribute according to the type, the part of speech or the value range of the target entity attribute.
12. An open web page-oriented entity attribute extraction system comprises:
the webpage preprocessing module is used for extracting the text of the open webpage;
the query expansion module is used for obtaining a candidate text set of the target entity from the extracted text;
the attribute extraction module is used for selecting a rule-based mode or a statistical-based mode to extract the value of the target entity attribute from the candidate text set according to the frequency of the target entity attribute in the training text set, wherein the rule-based mode or the statistical-based mode is used for calculating the frequency of the target entity attribute in the training text set, if the frequency exceeds a preset threshold value, the value of the target entity attribute is extracted according to a constructed statistical model, otherwise, the value of the target entity attribute is extracted in an inverted index mode from the initial state of the constructed cascade finite state automaton aiming at the text matched with the candidate text set; the training text set is used for training the statistical model, and the statistical model is a conditional random field model.
CN201510071993.6A 2015-02-11 2015-02-11 Entity attribute extraction method and system for open webpage Active CN104636466B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510071993.6A CN104636466B (en) 2015-02-11 2015-02-11 Entity attribute extraction method and system for open webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510071993.6A CN104636466B (en) 2015-02-11 2015-02-11 Entity attribute extraction method and system for open webpage

Publications (2)

Publication Number Publication Date
CN104636466A CN104636466A (en) 2015-05-20
CN104636466B true CN104636466B (en) 2020-07-31

Family

ID=53215212

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510071993.6A Active CN104636466B (en) 2015-02-11 2015-02-11 Entity attribute extraction method and system for open webpage

Country Status (1)

Country Link
CN (1) CN104636466B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10803391B2 (en) 2015-07-29 2020-10-13 Google Llc Modeling personal entities on a mobile device using embeddings
CN106547753B (en) * 2015-09-16 2021-12-10 腾讯科技(深圳)有限公司 Information analysis method and electronic equipment
CN107402933A (en) * 2016-05-20 2017-11-28 富士通株式会社 Entity polyphone disambiguation method and entity polyphone disambiguation equipment
CN106776866A (en) * 2016-11-29 2017-05-31 首都师范大学 A kind of method that meeting original text on University Websites carries out Knowledge Extraction
CN108614828B (en) * 2016-12-12 2020-12-29 北大方正集团有限公司 Corpus analysis method and corpus analysis device based on rule template
CN107045529B (en) * 2017-01-16 2021-01-22 阿里巴巴(中国)有限公司 Network content acquisition method and device and service terminal
CN107368525B (en) * 2017-06-07 2020-03-03 广州视源电子科技股份有限公司 Method and device for searching related words, storage medium and terminal equipment
CN110709828A (en) * 2017-06-08 2020-01-17 北京嘀嘀无限科技发展有限公司 System and method for determining text attributes using conditional random field model
CN110019829B (en) * 2017-09-19 2021-05-07 绿湾网络科技有限公司 Data attribute determination method and device
CN107729319B (en) * 2017-10-18 2021-03-09 百度在线网络技术(北京)有限公司 Method and apparatus for outputting information
CN107992597B (en) * 2017-12-13 2020-08-18 国网山东省电力公司电力科学研究院 Text structuring method for power grid fault case
CN108363701B (en) * 2018-04-13 2022-06-28 达而观信息科技(上海)有限公司 Named entity identification method and system
TWI705338B (en) * 2018-06-14 2020-09-21 大陸商北京嘀嘀無限科技發展有限公司 Systems and methods for text attribute determination using a conditional random field model
CN109783651B (en) * 2019-01-29 2022-03-04 北京百度网讯科技有限公司 Method and device for extracting entity related information, electronic equipment and storage medium
CN110399433A (en) * 2019-07-23 2019-11-01 福建奇点时空数字科技有限公司 A kind of data entity Relation extraction method based on deep learning
CN112434530A (en) * 2019-08-06 2021-03-02 富士通株式会社 Information processing apparatus, information processing method, and computer program
CN111027318B (en) * 2019-10-12 2023-04-07 中国平安财产保险股份有限公司 Industry classification method, device and equipment based on big data and storage medium
CN111125438B (en) * 2019-12-25 2023-06-27 北京百度网讯科技有限公司 Entity information extraction method and device, electronic equipment and storage medium
CN113609838B (en) * 2021-07-14 2024-05-24 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Document information extraction and mapping method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298588A (en) * 2010-06-25 2011-12-28 株式会社理光 Method and device for extracting object from non-structured document
CN102591862A (en) * 2011-01-05 2012-07-18 华东师范大学 Control method and device of Chinese entity relationship extraction based on word co-occurrence
CN102831251A (en) * 2012-09-20 2012-12-19 北京理工大学 Full automatic web page structural data extracting method based on dynamic learning framework
CN103034693A (en) * 2012-12-03 2013-04-10 哈尔滨工业大学 Open-type entity and type identification method thereof
CN103617280A (en) * 2013-12-09 2014-03-05 苏州大学 Method and system for mining Chinese event information
CN104239300A (en) * 2013-06-06 2014-12-24 富士通株式会社 Method and device for excavating semantic keywords from text

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002082962A (en) * 2000-09-08 2002-03-22 Hitachi Ltd Information provision method in engineering portal site
CN102495892A (en) * 2011-12-09 2012-06-13 北京大学 Webpage information extraction method
CN103268339B (en) * 2013-05-17 2016-06-01 中国科学院计算技术研究所 Named entity recognition method and system in Twitter message
CN103324700B (en) * 2013-06-08 2017-02-01 同济大学 Noumenon concept attribute learning method based on Web information
CN103500208B (en) * 2013-09-30 2016-08-17 中国科学院自动化研究所 Deep layer data processing method and system in conjunction with knowledge base
CN103824115B (en) * 2014-02-28 2017-07-21 中国科学院计算技术研究所 Towards the inter-entity relation estimating method and system of open network knowledge base

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298588A (en) * 2010-06-25 2011-12-28 株式会社理光 Method and device for extracting object from non-structured document
CN102591862A (en) * 2011-01-05 2012-07-18 华东师范大学 Control method and device of Chinese entity relationship extraction based on word co-occurrence
CN102831251A (en) * 2012-09-20 2012-12-19 北京理工大学 Full automatic web page structural data extracting method based on dynamic learning framework
CN103034693A (en) * 2012-12-03 2013-04-10 哈尔滨工业大学 Open-type entity and type identification method thereof
CN104239300A (en) * 2013-06-06 2014-12-24 富士通株式会社 Method and device for excavating semantic keywords from text
CN103617280A (en) * 2013-12-09 2014-03-05 苏州大学 Method and system for mining Chinese event information

Also Published As

Publication number Publication date
CN104636466A (en) 2015-05-20

Similar Documents

Publication Publication Date Title
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN111104794B (en) Text similarity matching method based on subject term
JP6309644B2 (en) Method, system, and storage medium for realizing smart question answer
CN104933027B (en) A kind of open Chinese entity relation extraction method of utilization dependency analysis
CN105095204B (en) The acquisition methods and device of synonym
US11080295B2 (en) Collecting, organizing, and searching knowledge about a dataset
TWI662425B (en) A method of automatically generating semantic similar sentence samples
US9183274B1 (en) System, methods, and data structure for representing object and properties associations
JP6466952B2 (en) Sentence generation system
CN110502642B (en) Entity relation extraction method based on dependency syntactic analysis and rules
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
WO2021146831A1 (en) Entity recognition method and apparatus, dictionary creation method, device, and medium
CN103678684A (en) Chinese word segmentation method based on navigation information retrieval
CN107180026B (en) Event phrase learning method and device based on word embedding semantic mapping
CN111177591A (en) Knowledge graph-based Web data optimization method facing visualization demand
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN109213998A (en) Chinese wrongly written character detection method and system
CN108959630A (en) A kind of character attribute abstracting method towards English without structure text
Sembok et al. Arabic word stemming algorithms and retrieval effectiveness
CN111444713B (en) Method and device for extracting entity relationship in news event
JP2019083040A (en) System and method for generating data for generating sentences
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
KR101663038B1 (en) Entity boundary detection apparatus in text by usage-learning on the entity&#39;s surface string candidates and mtehod thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Cheng Xueqi

Inventor after: Jia Yantao

Inventor after: Zhao Zeya

Inventor after: Wang Yuanzhuo

Inventor after: Jin Xiaolong

Inventor after: Xiong Jinhua

Inventor after: Li Manling

Inventor after: Lin Hailun

Inventor after: Xu Hongbo

Inventor before: Cheng Xueqi

Inventor before: Jia Yantao

Inventor before: Zhao Zeya

Inventor before: Wang Yuanzhuo

Inventor before: Xiong Jinhua

Inventor before: Li Manling

Inventor before: Lin Hailun

Inventor before: Xu Hongbo

GR01 Patent grant
GR01 Patent grant