CN102110087A - Method and device for resolving entities in character data - Google Patents

Method and device for resolving entities in character data Download PDF

Info

Publication number
CN102110087A
CN102110087A CN2009102434748A CN200910243474A CN102110087A CN 102110087 A CN102110087 A CN 102110087A CN 2009102434748 A CN2009102434748 A CN 2009102434748A CN 200910243474 A CN200910243474 A CN 200910243474A CN 102110087 A CN102110087 A CN 102110087A
Authority
CN
China
Prior art keywords
entity
sets
chain
entity sets
altogether
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2009102434748A
Other languages
Chinese (zh)
Inventor
宗良
万小军
杨建武
吴於茜
肖建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd, Peking University, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd
Priority to CN2009102434748A priority Critical patent/CN102110087A/en
Publication of CN102110087A publication Critical patent/CN102110087A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a method and a device for resolving entities in character data, wherein the method comprises steps as follows: obtaining a reference language material and a language material to be treated from the character data; obtaining a first entity set from the reference language material and establish a coreferential relationship between then entities in the first entity set to obtain a second entity set; obtaining a third entity set from the language material to be treated and establishing a training set and a testing set by combining the first entity set and the first entity set; adopting a classifying method to compute the training set and the testing set; and identifying the coreferential relationship between the third entity set and the second entity set according the computation result. The invention overcomes the problem that the coreferential relationship between words presented to a user is wrong due to the lack of standardization, colloquial tendency, etc. of the character, and therefore the effect of correct direction and convenience in identification by the user is realized.

Description

The method and apparatus that entity is cleared up in the character data
Technical field
The present invention relates to computer data relation and handle, the method and apparatus that entity is cleared up in a kind of character data.
Background technology
Refer to that to clear up task be exactly that the difference of same entity in the real world is described the process that is merged together, comprise mainly that personal pronoun is cleared up with noun phrase to clear up.In the coreference resolution task, with current investigation be used in reference to the statement object be called anaphor, directed statement object is called antecedent.For example, in the sentence below, " leader of [Chinese Mining Industry Group Co.,Ltd] is bold in innovation; effectively stopped economic slump; [company] output value increases progressively with 33% amplitude every year on average ", when investigating the entity that " company " speech wherein referred to, " company " this statement object is called as anaphor, and " Chinese Mining Industry Group Co.,Ltd " in the sentence then is the pairing antecedent of this anaphor.Coreference resolution is exactly the process of definite anaphor antecedent pointed.
Lift the process that a simple case illustrates coreference resolution in the Chinese news analysis below.In body, occur as next section word: " ... Hongyuan, the Guangdong 86-84 of team defeats the two deer battery teams of Aug. 1st; with the leading adversary of 2: 0 total scores ... ", below several the comments that are this to this news: " 81 representatives are in the past; future represent in the Hongyuan ", " get back to the home court; Eight One team stands to gain ", " Guangdong team is just honourable for the moment "; " 81 " in the comment, " Eight One team " all is " the two deer battery teams of Aug. 1st " that point in the text, and " Hongyuan ", " Guangdong team " all is " Hongyuan, the Guangdong team " that points in the text.To clear up purpose be to each entity in the comment to entity in the Chinese news analysis, judges whether he points to certain entity in the text, if select the antecedent of a most representative entity as this entity from text.
Present coreference resolution algorithm is mainly based on binary classification algorithm, by the series of features between definition anaphor and the antecedent, the method of use machine learning is judged the co-reference that whether has between anaphor and the antecedent, by certain link policy the entity statement that all have co-reference is merged into an entity then.Existing coreference resolution system handles all be the more intense language material of standardization, body for example, radio account or the like.News analysis is that a kind of people are used for style that incident is recently given opinion.Along with people in that mutually online role is gradually from the supplier that taker changes information into that obtains of information, the bulk information that comprises in the news analysis becomes the focus that the researcher pays close attention to.Compare with the traditional text language material, Chinese news analysis language material has following characteristics:
1, the text standardization is poor.Because news analysis is to be write by the internet netizen, may comprise a large amount of nonstandard contents, common have wrongly written or mispronounced characters, unnecessary space, insignificant symbol, informal another name or the like.For example, " Huiyuan " may be write as " converging round " mistakenly, and comprised a large amount of insignificant spaces in " resistance Coca-Cola ".
2, various comment style.Because comment author's background is different, cause using between the different comments different words, sentence formula, tone or the like.
3, brief and concise.News analysis is used to deliver the view of oneself, does not generally need very detailed explaining, and the news item comment usually only is made up of a few words.
4, topic relativity.News analysis is that certain media event that takes place is recently made comments, and therefore most news analysis all is closely to center on personage or the incident mentioned in the body to give opinion.
Because there are characteristics such as lack of standardization, colloquial style in above-mentioned character, present co-reference mistake between the word to the user, as, go up display label " place name " in character data " the Beijing Library ", and not demonstration " mechanism's name: Beijing National library ", cause the user to read, retrieve, when translating, error understanding occurs.
Summary of the invention
The present invention aims to provide the method and apparatus that entity is cleared up in the character data, and it can solve because there are characteristics such as lack of standardization, colloquial style in character, presents the problem of the co-reference mistake between the word to the user.
According to an aspect of the present invention, provide the method that entity is cleared up in a kind of character data, having comprised:
Obtain benchmark language material and pending language material in the character data;
In described benchmark language material, obtain first entity sets, entity in first entity sets is set up co-reference, obtain second entity sets;
In described pending language material, obtain the 3rd entity sets, and construct training set, test set by described the 3rd entity sets and first entity sets;
Adopt sorting technique that described training set and test set are carried out computing;
Identify co-reference between the 3rd entity sets and second entity sets according to calculated result.
Preferably, connect with chain type between the entity with co-reference in described second entity sets, formation refers to chain altogether.
Preferably, the described process that constructs training set comprises: to any entity A in the 3rd entity sets, entity B has co-reference in the entity A and first entity sets if identify, and entity B is present among the common finger chain C in second entity sets, entity A and refer to that altogether each entity among the chain C all is configured to positive example then becomes counter-example with entity structure in other except that referring to chain C altogether in second entity sets refers to chain altogether;
If entity B is not present in arbitrary of second entity sets and refers to that altogether in the chain, then entity A and entity B are configured to positive example so, become counter-example with all entity structures in referring to chain in second entity sets altogether;
The described process that constructs test set comprises: any entity in the 3rd entity sets and all entities in first entity sets constitute each test case.
Preferably, described sorting technique is decision tree, bayesian algorithm, support vector machine or maximum entropy model.
Preferably, the described process that training set, test set are carried out computing comprises:
The structural attitude function obtains training characteristic function of a set value to each positive example in the training set, counter-example computing, to each test case computing in the test set, obtains test characteristic function of a set value;
To train the characteristic function of a set value by the sorting technique computing, obtain corresponding operational model, and use described operational model, obtain operation result the computing of test characteristic function of a set value.
Preferably, described process according to calculated result identification comprises:
Described operation result is the similarity value between the interior entity E of the entity D in the 3rd entity sets and first entity sets in the current test case;
If the similarity value, is then judged entity E greater than similarity threshold whether in the common finger chain in second entity sets, if do not exist, the entity that entity E is pointed to as entity D then; If, then from referring to select the entity that an entity points to as entity D the chain altogether.
According to another aspect of the present invention, also provide the device that entity is cleared up in a kind of character data, having comprised:
Selected cell is used for obtaining benchmark language material and pending language material from character data;
First clears up the unit, is used for identifying first entity sets in described benchmark language material, and entity in first entity sets is set up co-reference, obtains second entity sets; Or in pending language material, identify the 3rd entity sets;
Tectonic element is used for constructing training set, test set by described the 3rd entity sets and first entity sets;
Taxon is used for described training set and test set are carried out sort operation, draws operation result;
Second clears up the unit, is used for identifying co-reference between the 3rd entity sets and second entity sets according to described operation result.
Preferably, comprising: described first to clear up second entity sets that the unit obtains be to connect with chain type between the entity, constitutes the entity sets that refers to chain altogether.
Preferably, comprise in the described tectonic element:
The training set constructing module, be used for any entity A to the 3rd entity sets, entity B has co-reference in the entity A and first entity sets if identify, and entity B is present among the common finger chain C in second entity sets, entity A and refer to that altogether each entity among the chain C all is configured to positive example then becomes counter-example with entity structure in other except that referring to chain C altogether in second entity sets refers to chain altogether;
If entity B is not present in arbitrary of second entity sets and refers to that altogether in the chain, then entity A and entity B are configured to positive example so, become counter-example with all entity structures in referring to chain in second entity sets altogether;
The test set constructing module is used for all entity structures in any entity of the 3rd entity sets and first entity sets are become each test case.
Preferably, comprise in the described taxon:
The eigenwert module is used for the structural attitude function to each positive example in the training set, counter-example computing, obtains training characteristic function of a set value, to each test case computing in the test set, obtains test characteristic function of a set value;
The sort operation module will train the characteristic function of a set value by the sorting technique computing, obtain corresponding operational model, use described operational model to the computing of test characteristic function of a set value, obtain operation result.
Preferably, second clears up the unit and comprises:
Discrimination module, when being used for entity D in described operation result is current test case the 3rd entity sets and the similarity value between the entity in first entity sets, determine the maximum similarity value the entity E in corresponding first entity sets, whether the value of differentiating maximum similarity is greater than similarity threshold, if greater than, judge that then entity E is whether in the common finger chain in second entity sets;
Point to module, be used for determining entity E in the common finger chain of described second entity sets, then from referring to select the entity that an entity points to as entity D the chain altogether when discrimination module; As sporocarp E not as described in the common finger chain in second entity sets, the entity that entity E is pointed to as entity D then.
Because method and apparatus of the present invention has adopted benchmark language material, pending language material structure training set, test set, and as the input parameter of sorting technique, to the entity in the pending language material, providing more accurately, entity points to, overcome character and had characteristics such as lack of standardization, colloquial style, present the problem of the co-reference mistake between the word to the user, and then reached the effect of pointing to accurately, being convenient to User Recognition.
Description of drawings
Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:
Fig. 1 shows the process flow diagram of embodiments of the invention one;
Fig. 2 shows the synoptic diagram that foundation of the present invention refers to chain altogether;
Fig. 3 shows the process flow diagram of setting up co-reference for the entity in the benchmark language material;
Fig. 4 shows the schematic diagram of the inventive method embodiment;
Fig. 5 shows the structural drawing of apparatus of the present invention embodiment.
Embodiment
To realizing the entity digestion process between the entity in the character data, the present invention's benchmark language material from character data, entity in the pending language material respectively constructs for better, and the row operation of going forward side by side identification is cleared up to obtain better entity.Describe embodiments of the invention in detail below in conjunction with accompanying drawing, embodiments of the invention comprise the embodiment of method and the embodiment of device.
Referring to Fig. 1, Fig. 1 is the process flow diagram of method embodiment one, and the flow process of the method that entity is cleared up in this character data comprises:
S11: obtain benchmark language material and pending language material in the character data;
S12: obtain first entity sets in described benchmark language material, the co-reference of discerning in first entity sets obtains second entity sets;
S13: in described pending language material, obtain the 3rd entity sets, and construct training set, test set by the 3rd entity sets and first entity sets;
S14: adopt sorting technique that training set and test set are carried out computing;
S15: identify co-reference between the 3rd entity sets and second entity sets according to calculated result.
For ease of difference, the first alleged entity sets of the present invention is the entity that identifies in the benchmark language material, and second entity sets is the entity that first entity sets has been set up co-reference, and the 3rd entity sets is the entity that identifies in the pending language material.
Embodiment one has set forth the flow process of the inventive method, below by being that example illustrates entity digestion process of the present invention to selected news character data among the embodiment two, in the news character data, usually the relation of need referring to the entity in the news analysis is cleared up, but because the character of comment exists lack of standardization, characteristics such as colloquial style, when the user presents co-reference between the word, mistake appears easily, or omit, therefore, need structure training set earlier, test set can be distinguished news data earlier, obtains body and news analysis respectively according to similar signs such as labels, with body as the benchmark language material, with news analysis as pending language material.
Obtain earlier first entity sets, the 3rd entity sets in body, the news analysis, and first entity sets of body is carried out entity clear up, obtain to have second entity sets of co-reference.
First entity sets of body is carried out the process that entity clears up to be comprised:
1) obtains the relevant information between the entity in the body; Mainly comprise the entity content, entity type, information such as position appear in physical length.
2) similarity between the computational entity; The tolerance mode of similarity can adopt various ways, can be that guidance arranged or guideless.
3) the similarity value is judged to be greater than the entity of a certain certain threshold level has co-reference, specifically mainly consider that the inter-entity feature comprises lexical feature, grammar property, aspects such as distance feature.Similarity between the entity is calculated the simple summation of adopting each fundamental function.When the similarity between the entity during, think that these two entities have co-reference greater than a certain preset threshold.
It is as shown in table 1 below to use feature during the body entity is cleared up:
Table 1
The feature title Feature description
Similarity of character string Similarity of character string between entity A and the entity B uses editing distance
Type matching Whether entity A and entity B type mate
Information is sewed in front and back Whether sew by shared identical front and back with entity B for entity A
Single plural number coupling Whether single plural number of entity A and entity B mates
Distance feature Sentence number between entity A and the entity B
The another name feature Entity A and entity B whether one be another another name
All entities that point to same things in the body are merged into one refer to chain altogether, as shown in Figure 2,, can form many and refer to chain altogether because it is more to have an entity of co-reference.
When determining to refer to chain altogether, referring to Fig. 3, the link selection strategy of the embodiment of the invention is mainly taked following steps:
S31:,, select the entity B of a similarity maximum in the entity in text as candidate's entity by the output result of support vector machine to each entity A in the comment;
S32: whether judge similarity greater than threshold value, setting threshold is zero;
S33: if this similarity value less than zero, this entity does not point to any entity in the text so;
S34: if this similarity value greater than zero, the result who clears up by entity in the text so checks that certain bar whether entity B appears at text refers in the chain altogether;
S35: do not refer in the chain entity of so directly selecting entity B to point to if B appears at arbitrary of text altogether as entity A;
S36: if certain bar that B is present in the text refer to altogether in the chain, refer to the chain to select the longest entity C of length as entity A entity pointed altogether from this so.
By second entity sets with co-reference that first entity sets is set up, it is that the 3rd entity sets is set up co-reference more accurately that the co-reference chain that obtains is convenient to.
Obtain the 3rd entity sets and first entity sets constructs training set, test set by news analysis, the process of structure is as follows:
When making up training set, to any entity A in the 3rd entity sets, if certain entity B has co-reference in the A and first entity sets, then clear up the common finger chain of output according to first entity sets, check that certain bar whether entity B is present in second entity sets refers in the chain altogether;
If certain bar that B is present in second entity sets refers among the chain C altogether, entity A and this refer to that altogether each entity among the chain C all is configured to positive example so, refer to altogether that with except that referring to chain C altogether other interior entity structure of chain becomes counter-example.
If B is not present in arbitrary of second entity sets and refers to altogether in the chain, entity A and entity B are configured to refer to altogether in the positive example and second entity sets that all entities in the chain constitute the rebellion example so.
When making up training set, to any entity A in the 3rd entity sets, if there is not the entity B that has co-reference with A in first entity sets, each entity in the A and first entity sets all is configured to counter-example so.
During the structure test set, to any entity A in the 3rd entity sets, all entities that occur in the A and first entity sets constitute each test case.
After building training set and test set, need carry out computing to training set and test set by sorting technique;
When carrying out the sorting technique computing, can adopt sorting techniques such as decision tree, Bayes, support vector machine, maximum entropy model.Select support vector machine as disaggregated model in the present embodiment.
When selecting support vector machine to carry out computing as disaggregated model, need the eigenwert of each example of structure earlier, at the news character data, except the feature in the table 1, some features that need discern have also been added, as shown in table 2, wherein entity A and entity B are two entities in a training set or the test set example:
Table 2
Figure G2009102434748D00121
Wherein essential characteristic mainly comprises lexical feature, grammar property, semantic feature or the like.The feature that increases according to Chinese news analysis characteristics mainly comprises the following aspects:
Entity frequency in the text: the number of times of an entity appearance is many more in the text, and the importance of this entity is just high more so, and corresponding news analysis just more might be around this entity expansion, and the common finger phenomenon in the comment also just might take place around this entity more.In experiment, we are divided into three situations to the number of times that entity occurs: occur 1 time, occur 2~3 times, occur giving different weights to these three kinds of situations respectively more than 3 times.
Phonetic similarity: because news analysis is write by the user, so some nonstandard places wherein may occur, wherein modal a kind of be exactly the same or analogous wrongly written or mispronounced characters of phonetic, for example; " Huiyuan " may be write as " the remittance circle "; " Xie Yalong " may be write as " Xie Yalong ", seem not accurate enough when using traditional string editing distance to calculate such two entity similarities.In experiment, we use a spelling book all entities to be changed into the phonetic of their correspondence, judge the similarity of two entities by the similarity of calculating phonetic, for example; " Huiyuan " and " converge circle " result is 50% calculating character string similarity the time, and be 100% when calculating their phonetic similarity.
English matching degree: the another kind of common phenomenon that occurs in comment just is to use some clear and easy to understand English entities that replace in texts, for example, " ebay " replace occurring in the text " Eachnet ", " JAY " replace occurring in the text " Zhou Jielun ".This class English is standard sometimes and not really, uses traditional dictionary may cannot know that their actually wish notion of explaining.In experiment, use an online Web-Based Dictionary that this class English is inquired about, when an entity that occurs is not Chinese, use this dictionary to retrieve this word, whether another entity of retrieval occurs in the explanation of returning.If this entity is to just more having co-reference so.
In the support vector machine calculating process, at first adopt each example in the training set to represent with the form of eigenwert, and with the summation of the eigenwert in each example or other mathematical operation, input as support vector machine is predicted, obtain the model of corresponding output, the model that is obtained is predicted computing to the example in the test set, support vector machine output operation result after the computing.
Can identify the co-reference between the entity in the entity of current computing in the test set and first set by calculated result, for obtaining co-reference more accurately, when determining co-reference, also can judge identification by the common finger chain in second entity sets of setting up co-reference, the process of concrete judgement identification is as follows:
Use support vector machine by after the training set computing, the entity in the disaggregated model test set that draws is to predicting, after the prediction, refers to when concerning in foundation, sets up in the following manner:
1),, in first entity sets, select the entity B of a similarity value maximum as candidate's entity by the output result of sorter to each entity A in the 3rd entity sets.
2) value of judgement similarity, if this similarity value fails to reach the similarity threshold that is judged to be co-reference, this entity does not point to any entity in first entity sets so.
3) if this similarity value greater than the similarity threshold that is judged to be co-reference, carrying out entity by " first entity sets to body carries out the process that entity is cleared up " in aforementioned so clears up, check that certain bar whether entity B appears in second entity sets refers in the chain altogether
If arbitrary of not appearing in second entity sets of B refers in the chain altogether, the entity of so directly selecting entity B to point to as entity A;
If certain bar that B is present in the text refers in the chain altogether, refer to select the chain an only entity C as entity A entity pointed altogether from this so.For example, occur in the comment entity " move " with first entity sets in the similarity maximum that " moves " of the identical entity that occurs, through checking, find " moving " appear in second entity sets one refer to altogether chain (" moving "; " China Mobile ") in; understand for the ease of the user, select " China Mobile " conduct " move " entity pointed.From referring to the chain to select the common strategy of entity also to comprise altogether: from refer to select the chain the longest entity altogether, from referring to be chosen in the benchmark language material the chain altogether, being occurrence number is maximum in the body entity or from referring to be chosen in the chain entity that occurs first in the benchmark language material altogether.
Above-mentioned identification process also can be referring to Fig. 4.At last, the co-reference that the 3rd entity sets and first entity sets are set up is shown to the user by forms such as label or mark colors.
Describe method embodiment of the present invention above in detail, provide device embodiment of the present invention below, referring to Fig. 5, the device of clearing up according to entity in the symbol data of the embodiment of the invention comprises:
Selected cell 51 is used for obtaining benchmark language material and pending language material from character data;
First clears up unit 52, is used for identifying first entity sets in described benchmark language material, and entity in first entity sets is set up co-reference, obtains second entity sets; Or in pending language material, identify the 3rd entity sets;
Tectonic element 53 is used for constructing training set, test set by described the 3rd entity sets and first entity sets;
Taxon 54 is used for described training set and test set are carried out sort operation, draws operation result;
Second clears up unit 55, is used for identifying co-reference between the 3rd entity sets and second entity sets according to described operation result.
Preferably, described first clears up second entity sets that unit 52 obtains and is: connect with chain type between the entity and constitute the entity sets that refers to chain altogether.
Preferably, described tectonic element 53 comprises:
The training set constructing module, be used for any entity A to the 3rd entity sets, entity B has co-reference in the entity A and first entity sets if identify, and entity B is present among the common finger chain C in second entity sets, entity A and refer to that altogether each entity among the chain C all is configured to positive example then becomes counter-example with entity structure in other except that referring to chain C altogether in second entity sets refers to chain altogether; If entity B is not present in arbitrary of second entity sets and refers to that altogether in the chain, then entity A and entity B are configured to positive example so, become counter-example with all entity structures in referring to chain in second entity sets altogether;
The test set constructing module is used for all entity structures in any entity of the 3rd entity sets and first entity sets are become each test case.
Preferably, described taxon 54 comprises:
The eigenwert module is used for the structural attitude function to each positive example in the training set, counter-example computing, obtains training characteristic function of a set value, to each test case computing in the test set, obtains test characteristic function of a set value;
The sort operation module will train the characteristic function of a set value by the sorting technique computing, obtain corresponding operational model, use described operational model to the computing of test characteristic function of a set value, obtain operation result.
Preferably, second clears up unit 55 and comprises:
Discrimination module, when being used for entity D in described operation result is current test case the 3rd entity sets and the similarity value between the entity in first entity sets, determine the maximum similarity value the entity E in corresponding first entity sets, whether the value of differentiating maximum similarity is greater than similarity threshold, if greater than, judge that then entity E is whether in the common finger chain in second entity sets;
Point to module, be used for determining entity E in the common finger chain of described second entity sets, then from referring to select the entity that an entity points to as entity D the chain altogether when discrimination module; As sporocarp E not as described in the common finger chain in second entity sets, the entity that entity E is pointed to as entity D then.For example, occur in the comment entity " move " with first entity sets in the similarity maximum that " moves " of the identical entity that occurs, through checking, find " moving " appear in second entity sets one refer to altogether chain (" moving "; " China Mobile ") in; understand for the ease of the user, select " China Mobile " conduct " move " entity pointed.From referring to the chain to select the common strategy of entity also to comprise altogether: from refer to select the chain the longest entity altogether, from referring to be chosen in the benchmark language material the chain altogether, being occurrence number is maximum in the body entity or from referring to be chosen in the chain entity that occurs first in the benchmark language material altogether.
The entity that the device of clearing up according to entity in the character data of the embodiment of the invention can adopt the method in the foregoing description to carry out in the character data is cleared up, thus this to this character data in the processing procedure of the device cleared up of entity repeat no more.
Because method and apparatus of the present invention has adopted benchmark language material, pending language material structure training set, test set, and as the input parameter of sorting technique, to the entity in the pending language material, providing more accurately, entity points to, overcome character and had characteristics such as lack of standardization, colloquial style, present the problem of the co-reference mistake between the word to the user, and then reached the effect of pointing to accurately, being convenient to User Recognition.
The present invention passes judgement on analysis in news analysis, and many Chinese comments such as emotion key element extraction are excavated on the subtask and used, and have obtained good effect, and the performance index excellence has realized purpose of the present invention.The present invention has good practicability and popularizing application prospect.
Although disclose specific embodiments of the invention and accompanying drawing for the purpose of illustration, its purpose is to help to understand content of the present invention and implement according to this, but it will be appreciated by those skilled in the art that: without departing from the spirit and scope of the invention and the appended claims, various replacements, variation and modification all are possible.Therefore, the present invention should not be limited to most preferred embodiment and the disclosed content of accompanying drawing, and the scope of protection of present invention is as the criterion with the scope that claims define.
Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with the general calculation device, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation element forms, alternatively, they can be realized with the executable program code of calculation element, carry out by calculation element thereby they can be stored in the memory storage, perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (11)

1. the method that entity is cleared up in the character data is characterized in that, comprising:
Obtain benchmark language material and pending language material in the character data;
In described benchmark language material, obtain first entity sets, entity in first entity sets is set up co-reference, obtain second entity sets;
In described pending language material, obtain the 3rd entity sets, and construct training set, test set by described the 3rd entity sets and first entity sets;
Adopt sorting technique that described training set and test set are carried out computing;
Identify co-reference between the 3rd entity sets and second entity sets according to calculated result.
2. method according to claim 1 is characterized in that, connects with chain type between the entity with co-reference in described second entity sets, constitutes to refer to chain altogether.
3. method according to claim 2 is characterized in that,
The described process that constructs training set comprises:
To any entity A in the 3rd entity sets, entity B has co-reference in the entity A and first entity sets if identify, and entity B is present among the common finger chain C in second entity sets, entity A and refer to that altogether each entity among the chain C all is configured to positive example then becomes counter-example with entity structure in other except that referring to chain C altogether in second entity sets refers to chain altogether;
If entity B is not present in arbitrary of second entity sets and refers to that altogether in the chain, then entity A and entity B are configured to positive example so, become counter-example with all entity structures in referring to chain in second entity sets altogether;
The described process that constructs test set comprises:
Any entity in the 3rd entity sets and all entities in first entity sets constitute each test case.
4. method according to claim 1 is characterized in that, described sorting technique is decision tree, bayesian algorithm, support vector machine or maximum entropy model.
5. according to claim 3 or 4 described methods, it is characterized in that the described process that training set, test set are carried out computing comprises:
The structural attitude function obtains training characteristic function of a set value to each positive example in the training set, counter-example computing, to each test case computing in the test set, obtains test characteristic function of a set value;
To train the characteristic function of a set value by the sorting technique computing, obtain corresponding operational model, and use described operational model, obtain operation result the computing of test characteristic function of a set value.
6. method according to claim 5 is characterized in that, described process according to calculated result identification comprises:
Described operation result is the similarity value between the interior entity E of the entity D in the 3rd entity sets and first entity sets in the current test case;
If the similarity value, is then judged entity E greater than similarity threshold whether in the common finger chain in second entity sets, if do not exist, the entity that entity E is pointed to as entity D then; If, then from referring to select the entity that an entity points to as entity D the chain altogether.
7. the device that entity is cleared up in the character data is characterized in that, comprising:
Selected cell is used for obtaining benchmark language material and pending language material from character data;
First clears up the unit, is used for identifying first entity sets in described benchmark language material, and entity in first entity sets is set up co-reference, obtains second entity sets; Or in pending language material, identify the 3rd entity sets;
Tectonic element is used for constructing training set, test set by described the 3rd entity sets and first entity sets;
Taxon is used for described training set and test set are carried out sort operation, draws operation result;
Second clears up the unit, is used for identifying co-reference between the 3rd entity sets and second entity sets according to described operation result.
8. device according to claim 7 is characterized in that, described first clears up second entity sets that the unit obtains is: connect with chain type between the entity and constitute the entity sets that refers to chain altogether.
9. device according to claim 8 is characterized in that, comprises in the described tectonic element:
The training set constructing module, be used for any entity A to the 3rd entity sets, entity B has co-reference in the entity A and first entity sets if identify, and entity B is present among the common finger chain C in second entity sets, entity A and refer to that altogether each entity among the chain C all is configured to positive example then becomes counter-example with entity structure in other except that referring to chain C altogether in second entity sets refers to chain altogether; If entity B is not present in arbitrary of second entity sets and refers to that altogether in the chain, then entity A and entity B are configured to positive example so, become counter-example with all entity structures in referring to chain in second entity sets altogether;
The test set constructing module is used for all entity structures in any entity of the 3rd entity sets and first entity sets are become each test case.
10. device according to claim 9 is characterized in that, comprises in the described taxon:
The eigenwert module is used for the structural attitude function to each positive example in the training set, counter-example computing, obtains training characteristic function of a set value, to each test case computing in the test set, obtains test characteristic function of a set value;
The sort operation module will train the characteristic function of a set value by the sorting technique computing, obtain corresponding operational model, use described operational model to the computing of test characteristic function of a set value, obtain operation result.
11. device according to claim 10 is characterized in that, described second clears up the unit comprises:
Discrimination module, when being used for entity D in described operation result is current test case the 3rd entity sets and the similarity value between the entity in first entity sets, determine the maximum similarity value the entity E in corresponding first entity sets, whether the value of differentiating maximum similarity is greater than similarity threshold, if greater than, judge that then entity E is whether in the common finger chain in second entity sets;
Point to module, be used for determining entity E in the common finger chain of described second entity sets, then from referring to select the entity that an entity points to as entity D the chain altogether when discrimination module; As sporocarp E not as described in the common finger chain in second entity sets, the entity that entity E is pointed to as entity D then.
CN2009102434748A 2009-12-24 2009-12-24 Method and device for resolving entities in character data Pending CN102110087A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102434748A CN102110087A (en) 2009-12-24 2009-12-24 Method and device for resolving entities in character data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102434748A CN102110087A (en) 2009-12-24 2009-12-24 Method and device for resolving entities in character data

Publications (1)

Publication Number Publication Date
CN102110087A true CN102110087A (en) 2011-06-29

Family

ID=44174250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102434748A Pending CN102110087A (en) 2009-12-24 2009-12-24 Method and device for resolving entities in character data

Country Status (1)

Country Link
CN (1) CN102110087A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013027129A1 (en) * 2011-08-24 2013-02-28 International Business Machines Corporation Entity resolution based on relationships to common entity
CN103106211A (en) * 2011-11-11 2013-05-15 中国移动通信集团广东有限公司 Emotion recognition method and emotion recognition device for customer consultation texts
WO2017197947A1 (en) * 2016-05-20 2017-11-23 腾讯科技(深圳)有限公司 Antecedent determination method and apparatus

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013027129A1 (en) * 2011-08-24 2013-02-28 International Business Machines Corporation Entity resolution based on relationships to common entity
US8965848B2 (en) 2011-08-24 2015-02-24 International Business Machines Corporation Entity resolution based on relationships to a common entity
CN103106211A (en) * 2011-11-11 2013-05-15 中国移动通信集团广东有限公司 Emotion recognition method and emotion recognition device for customer consultation texts
CN103106211B (en) * 2011-11-11 2017-05-03 中国移动通信集团广东有限公司 Emotion recognition method and emotion recognition device for customer consultation texts
WO2017197947A1 (en) * 2016-05-20 2017-11-23 腾讯科技(深圳)有限公司 Antecedent determination method and apparatus
US10810372B2 (en) 2016-05-20 2020-10-20 Tencent Technology (Shenzhen) Company Limited Antecedent determining method and apparatus

Similar Documents

Publication Publication Date Title
Haque et al. Sentiment analysis on large scale Amazon product reviews
CN108717406B (en) Text emotion analysis method and device and storage medium
Tiedemann et al. Efficient discrimination between closely related languages
Haque et al. Non-functional requirements classification with feature extraction and machine learning: An empirical study
Ghosh et al. Sentiment identification in code-mixed social media text
He et al. Identifying feature sequences from process data in problem-solving items with n-grams
US10755045B2 (en) Automatic human-emulative document analysis enhancements
El-Halees Mining opinions in user-generated contents to improve course evaluation
Valakunde et al. Multi-aspect and multi-class based document sentiment analysis of educational data catering accreditation process
Ison Detection of Online Contract Cheating Through Stylometry: A Pilot Study.
Shahare Sentiment analysis for the news data based on the social media
Rashid et al. Feature level opinion mining of educational student feedback data using sequential pattern mining and association rule mining
CN112395421B (en) Course label generation method and device, computer equipment and medium
Ismail et al. Sentiment analysis for Arabic dialect using supervised learning
Duong et al. An unsupervised method for OCR post-correction and spelling normalisation for Finnish
KR20160149050A (en) Apparatus and method for selecting a pure play company by using text mining
Shekhawat Sentiment classification of current public opinion on brexit: Naïve Bayes classifier model vs Python’s Textblob approach
Babhulgaonkar et al. Language identification for multilingual machine translation
Wings et al. A context-aware approach for extracting hard and soft skills
Fauziah et al. Lexicon based sentiment analysis in Indonesia languages: A systematic literature review
Cavalli-Sforza et al. Arabic readability research: current state and future directions
Akram An Empirical Study of AI Generated Text Detection Tools
Ahmed et al. Arabic Text Detection Using Rough Set Theory: Designing a Novel Approach
CN102110087A (en) Method and device for resolving entities in character data
CN111191029B (en) AC construction method based on supervised learning and text classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110629