CN105677632A - Method and device for taking temperature for extracting entities - Google Patents

Method and device for taking temperature for extracting entities Download PDF

Info

Publication number
CN105677632A
CN105677632A CN201410663066.9A CN201410663066A CN105677632A CN 105677632 A CN105677632 A CN 105677632A CN 201410663066 A CN201410663066 A CN 201410663066A CN 105677632 A CN105677632 A CN 105677632A
Authority
CN
China
Prior art keywords
template
reference table
entity
corpus
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410663066.9A
Other languages
Chinese (zh)
Inventor
方瑞玉
缪庆亮
张波
房璐
孟遥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201410663066.9A priority Critical patent/CN105677632A/en
Publication of CN105677632A publication Critical patent/CN105677632A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a method and device for taking temperature for extracting entities. The method for taking temperature for extracting entities comprises steps of creating reference tables in training corpus; taking candidate templates in training corpus based on reference tables; utilizing certified corpus to certificate the validity of candidate templates; adjusting relevant evaluation values of the reference tables according to the verification result; and determining candidate templates when the second pre-set condition is satisfied as the extracted templates. The reference tables comprise combinations of ordinary characters and meta-characters in accord with a first pre-set condition and corresponding valuation values.

Description

Extract the method and apparatus being used for extracting the template of entity
Technical field
This invention relates generally to field of information processing. Specifically, the method and apparatus that the present invention relates to a kind of template can extracted automatically, well for extracting entity.
Background technology
In recent years, along with the development of Internet technology and popularizing of individual's smart machine, online online content of text is that geometry level increases. In order to effectively process substantial amounts of content of text, it is necessary to automatically carry out information extraction.
Correspondingly, information extraction technique has had significant progress. Generally, information extraction technique can be divided into the information extraction technique of feature based and rule-based information extraction technique. The information extraction technique of feature based depends on the feature of substantial amounts of complexity, and the feature calculation etc. when the choosing of feature, the training of corresponding model, application needs substantial amounts of work and calculates resource. Rule-based information extraction technique is by introducing template, it is possible to avoid the work that the feature of substantial amounts of complexity is relevant. Template can be obtained by Template Learning.
Traditional Template Learning method produces template individually in independent example, and template is evaluated again, and the index of assessment is the ability of the more correct entity of template extraction and less false entries. But, there is the compromise of the precision during entity extracts and recall rate in traditional Template Learning method. This is because be difficult to hold extensive-concrete degree of produced template. Template excessively specifically can cause precision to rise and recall rate declines, and the excessively extensive meeting of template causes that recall rate rises and precise decreasing.
Therefore, it is desirable to the method and apparatus of a kind of template extracted for extracting entity, it can extract template automatically, well, and the template extracted can extract entity well.
Summary of the invention
The brief overview about the present invention given below, in order to the basic comprehension about certain aspects of the invention is provided. Should be appreciated that this general introduction is not that the exhaustive about the present invention is summarized. It is not intended to determine the key of the present invention or pith, and nor is it intended to limit the scope of the present invention. It is only intended to and provides some concept in simplified form, in this, as the preamble in greater detail discussed after a while.
It is an object of the invention to the problems referred to above for prior art, it is proposed that the method and apparatus of a kind of template can extracted automatically, well for extracting entity.
To achieve these goals, according to an aspect of the present invention, provide a kind of extraction method for extracting the template of entity, this template extraction method includes: from corpus, creating reference table, described reference table includes: meet general character and the combination of metacharacter, the corresponding evaluation of estimate of the first predetermined condition; According to reference table, from corpus, extract candidate template; Utilize checking language material, the effectiveness of checking candidate template; According to the result, adjust the relevant evaluation value in described reference table; And candidate template when described second predetermined condition being satisfied when second predetermined condition is defined as the template extracted.
According to another aspect of the present invention, provide a kind of extraction for extracting the equipment of the template of entity, this template extraction equipment includes: reference table creates device, it is used for: from corpus, creating reference table, described reference table includes: meet general character and the combination of metacharacter, the corresponding evaluation of estimate of the first predetermined condition; Candidate template extraction element, it is used for: according to reference table, from corpus, extracts candidate template; Validation verification device, it is used for: utilize checking language material, the effectiveness of checking candidate template; Evaluation of estimate adjusting apparatus, it is used for: according to the result, adjust the relevant evaluation value in described reference table; And control device, it is used for: candidate template when described second predetermined condition being satisfied when second predetermined condition is defined as the template extracted.
It addition, according to a further aspect in the invention, a kind of storage medium is additionally provided. Described storage medium includes machine-readable program code, and when performing described program code on messaging device, described program code makes described messaging device perform the said method according to the present invention.
Additionally, in accordance with a further aspect of the present invention, a kind of program product is additionally provided. Described program product includes the executable instruction of machine, and when performing described instruction on messaging device, described instruction makes described messaging device perform the said method according to the present invention.
Accompanying drawing explanation
Below with reference to the accompanying drawings illustrate embodiments of the invention, the above and other objects, features and advantages of the present invention can be more readily understood that. Parts in accompanying drawing are intended merely to and illustrate principles of the invention. In the accompanying drawings, same or similar technical characteristic or parts will adopt same or similar accompanying drawing labelling to represent. In accompanying drawing:
Fig. 1 illustrates the flow chart of the method extracting the template for extracting entity according to an embodiment of the invention;
Fig. 2 illustrates the flow chart of the method extracting candidate template according to an embodiment of the invention;
Fig. 3 illustrates the flow chart of the entity abstracting method according to the first embodiment of the present invention;
Fig. 4 illustrates the flow chart of entity abstracting method according to the second embodiment of the present invention;
Fig. 5 illustrates the flow chart of entity abstracting method according to the third embodiment of the invention;
Fig. 6 illustrates the block diagram extracting the equipment for the template that extracts entity according to embodiments of the present invention; And
Fig. 7 illustrates the schematic block diagram that can be used for implementing the computer of method and apparatus according to embodiments of the present invention.
Detailed description of the invention
Hereinafter in connection with accompanying drawing, the one exemplary embodiment of the present invention is described in detail. For clarity and conciseness, all features of actual embodiment are not described in the description. But, it should be recognized that, the process developing any this actual embodiment must be made a lot of decision specific to embodiment, to realize the objectives of developer, such as, meet those restrictive conditions relevant to system and business, and these restrictive conditions may change along with the difference of embodiment. Additionally, it also should be appreciated that, although development is likely to be extremely complex and time-consuming, but for having benefited from those skilled in the art of present disclosure, this development is only routine task.
At this, also need to illustrate be a bit, in order to avoid having obscured the present invention because of unnecessary details, illustrate only in the accompanying drawings and according to the closely-related apparatus structure of the solution of the present invention and/or process step, and eliminate other details little with relation of the present invention.It addition, it may also be noted that can combine with the element shown in one or more other accompanying drawing or embodiment and feature at the element described in the accompanying drawing of the present invention or a kind of embodiment and feature.
The basic thought of the present invention is to assess contextual information for correctly extracting the power of influence of target entity by establishment reference table, utilizes reference table to help to extract more accurately more extensive template. Additionally, also constantly the assessment of the basis template for producing adjusts, optimizes reference table, finally obtain best reference table and template when reaching to balance.
The flow process of the method extracting the template for extracting entity according to an embodiment of the invention is described below with reference to Fig. 1.
Fig. 1 illustrates the flow chart of the method extracting the template for extracting entity according to an embodiment of the invention. As shown in Figure 1, the method extracting the template for extracting entity according to an embodiment of the invention comprises the steps: from corpus, creating reference table, described reference table includes: the general character conformed to a predetermined condition and the combination of metacharacter, corresponding evaluation of estimate (step S1); According to reference table, from corpus, extract candidate template (step S2); Utilize checking language material, the effectiveness (step S3) of checking candidate template; According to the result, adjust the relevant evaluation value (step S4) in described reference table; Based on the reference table after adjusting, repeat said extracted, checking, set-up procedure, until meeting predetermined stoppage condition (step S5); And candidate template when predetermined stoppage condition being satisfied is defined as the template (step S6) extracted.
In step sl, from corpus, creating reference table, described reference table includes: the general character conformed to a predetermined condition and the combination of metacharacter, corresponding evaluation of estimate.
As it has been described above, invention introduces reference table, help to extract template. Therefore, reference table is first created.
The establishment of reference table is based on corpus.
It should be noted that corpus and checking language material only act on difference, its essence is identical, is the document being labelled with the entity that should extract. It is to say, corpus and checking language material are substantially and have marked language material. The template extracted due to the present invention is used for the entity in abstracting document, and what therefore mark is the entity that should be extracted in document. In the following description, the type of the entity should being extracted is address. Certainly the invention is not restricted to this. Address is only for example.
With the difference of checking language material, corpus is in that effect is different, therefore, it can the existing language material that marked is assigned as corpus and checking language material on demand.
Corpus is used for creating reference table and extracting candidate template, and verifies that language material is used for verifying candidate template.
For example, corpus and checking language material can include following 4:
(1)8streetaddress[GEO]508YoungSt.,Dallas,TX75202[/GEO]thephonenumber2149778222
(2)physicaladdressat[GEO]305S.CongressAve.Austin,Texas78704[/GEO]telephonenumberdigital
(3)9mailingaddress[GEO]P.O.Box1909,SeattleWA98111-1909[/GEO]mainphonenumber3445621
(4)mailingaddressis[GEO]P.O.Box139753106EastNCHighway54ResearchTrianglePark,NC27709[/GEO]telephoneis9195490097800
Wherein, [GEO] [/GEO] marks address sequence (entity that should be extracted) to be extracted.
So-called reference table is the table comprising a lot of entry, and each entry includes three key elements, namely general character, metacharacter, evaluation of estimate.
General character refers to the character in language material and document. Such as English character etc. One English word is made up of multiple English characters. Such as, " 8streetaddress " in language material (1) above is exactly general character, and wherein character " street " constitutes an English word.
Metacharacter is symbol, as the extensive expression of general character.Such as, language material (1) above is likely to after extensive to extract outer template: digitalstreetX1_l_0_1 [X] X1_r_0_1phonenumberdigital. Wherein [X] represents the entity to extract, and digital represents numeral, and X1_l_0_1 and X1_r_0_1 is metacharacter, represents the number by the corresponding word of an extensive part and the relative position of the context of the entity to extract. X1_l_0_1 specifically represents the entity (508YoungSt. to extract, Dallas, the number by the corresponding word of an extensive part (address) of context (8streetaddressthephonenumber2149778222) TX75202) is 1 and relative position (l represents left side, and 0_1 represents between the 0th to the 1st position to be the 1st word). Similarly, X1_r_0_1 specifically represents the entity (508YoungSt. to extract, Dallas, the number by the corresponding word of an extensive part (the) of context (8streetaddressthephonenumber2149778222) TX75202) is 1 and relative position (r represents right side, and 0_1 represents between the 0th to the 1st position to be the 1st word). So, this outer template entry corresponding, in reference table can be just " X1_l_0_1 ", " address ", " 1 " and " X1_r_0_1 ", " the ", " 3 ". The citing that wherein " 1 ", " 3 " are evaluation of estimate.
In reference table, the general character in an entry and metacharacter constitute a particular combination, also state that an evaluation of estimate for this combination, be used for showing that this general character (string) is by extensive probability in this entry.
It addition, template to be extracted can be divided into outer template and internal template.
Outer template is for the context of template extracting object (entity). Such as, above-mentioned to " digitalstreetX1_l_0_1 [X] X1_r_0_1phonenumberdigital ".
Internal template is for template extracting object itself. For language material (1), internal template is extracted for " 508YoungSt., Dallas, TX75202 ". Such as " digitalX1St.DallasTXdigital ".
Therefore, represent corresponding to the general character of outer template template extracting object context by an extensive part. Number and the relative position of word that above-mentioned general character is corresponding is represented corresponding to the metacharacter of outer template. Corresponding to the general character of internal template represent template extracting object itself by an extensive part. Number and the relative position of word that above-mentioned general character is corresponding is represented corresponding to the metacharacter of internal template.
On the basis of superincumbent introduction, it will be understood that according to definition above, it is possible to according to certain rule creation reference table from corpus. Such as, extract the combination of general character and the metacharacter conformed to a predetermined condition, and the number of times occurred in above-mentioned corpus according to general character initializes the evaluation of estimate that general character is corresponding with the combination of metacharacter.
The number etc. of the word that predetermined condition such as includes the size of contextual window, general character is corresponding.
Specifically, step S1 can be implemented as: for each sentence in corpus, the mark according to wherein entity for extracting, according to predetermined condition, extracts general character and metacharacter; According to the number of times that general character occurs in corpus, calculate corresponding evaluation of estimate.
In order to control the scale of reference table, it is possible to specify by word corresponding to extensive character string less than n word, say, that be 0 arrive n word by the number of word corresponding to extensive character string.N preferably takes 3 or 4. Meanwhile, in order to control the scale of reference table, it is also possible to regulation occurrence number in corpus is not recorded in reference table less than the general character of predetermined threshold. That is the word that the frequency of occurrences is too low is not added in reference table. Predetermined threshold is preferably equivalent to 2.
In step s 2, according to reference table, from corpus, extract candidate template.
As it has been described above, reference table creates based on corpus, thus the entry in reference table can corresponding on all templates that can extract from corpus.
Beam-search (BeamSearch) algorithm or bottom-up CYK algorithm (Cocke Younger Kasamialgorithm) can be utilized to carry out the extraction of candidate template.
Fig. 2 illustrates the flow chart of the method extracting candidate template according to an embodiment of the invention. As shown in Figure 2, the method extracting candidate template according to an embodiment of the invention comprises the steps: for each sentence in described corpus, randomly choose one or more word every time or phrase carries out extensive, to obtain preliminary template (step S21); Remove the preliminary template repeated and merge the preliminary template that there is inclusion relation, to obtain alternative templates (step S22); According to the corresponding evaluation of estimate meeting the general character of alternative templates and the combination of metacharacter in described reference table, calculate the score (step S23) of alternative templates; Choose the alternative templates of the predetermined quantity of highest scoring, as described candidate template (step S24).
In the step s 21, for each sentence in described corpus, randomly choose one or more word every time or phrase carries out extensive, to obtain preliminary template.
Such as, for 4 language material examples above, it is possible to obtain each preliminary template (for outer template) such as follows.
Based on the outer template that language material (1) generates:
(11): digitalstreetX1_l_0_1 [X] X1_r_0_1phonenumberdigital
(12): digitalX1_l_1_2address [X] thex1_r_1_2numberdigital
(13): the outer template that digitalX1_l_1_2address [X] thex2_r_1_3digital generates based on language material (2):
(21): X1_l_2_3addressX1_l_0_1 [X] telephoneX1_r_1_2digital
(22): the outer template that X1_l_2_3X1_l_1_2at [X] telephoneX1_r_1_2digital generates based on language material (3):
(31): digitalX1_l_1_2address [X] X1_r_0_1phonex_1_r_1_2digital
(32): digitalmailingaddress [X] X1_r_0_1X1_r_1_2numberdigital
(33): digitalmailingaddress [X] mainX1_r_1_2numberdigital
Based on the outer template that language material (4) generates:
(41): X1_l_2_3addressX1_l_0_1 [X] telephoneX1_r_1_2digital
(42): MailingX1_l_1_2is [X] telephoneisdigital
In step S22, remove the preliminary template repeated and merge the preliminary template that there is inclusion relation, to obtain alternative templates.
Such as, from different language materials, preliminary template (11): digitalstreetX1_l_0_1 [X] X1_r_0_1phonenumberdigital is all obtained, then only retain one of them.
Such as, preliminary template (13): digitalX1_l_1_2address [X] thex2_r_1_3digital contains preliminary template (12): digitalX1_l_1_2address [X] thex1_r_1_2numberdigital.The extensive degree of preliminary template (13) is higher. So, preliminary template (13) incorporates preliminary template (12), only retains preliminary template (13) for follow-up step.
The purpose of step S22 is to cover whole training examples with minimal number of template region.
It should be noted here that be cover whole training examples. The coverage of preliminary template can not be lost in order to merge.
According to above example, what finally retain should be template (31) and template (41). Template (31) can cover language material (1) and (3), and template (41) can cover language material (2) and (4).
In step S23, according to the corresponding evaluation of estimate meeting the general character of alternative templates and the combination of metacharacter in described reference table, calculate the score of alternative templates.
The template based on training need extraction it is as noted previously, as reference table to generate based on corpus, so necessarily can find the corresponding entry in reference table.
Can the score of summation/meansigma methods alternately template of corresponding evaluation of estimate of combination of alternative templates is corresponding, general character in reference table and metacharacter.
For example, alternative templates is (13): digitalX1_l_1_2address [X] thex2_r_1_3digital. So, reference table is found the entry of " street ", the entry of " X1_l_1_2 " and " phonenumber ", " x2_r_1_3 ", by the score of the arithmetic mean of instantaneous value of the evaluation of estimate of the two entry alternately template (13).
In step s 24, choose the alternative templates of the predetermined quantity of highest scoring, as the candidate template extracted.
In step s3, checking language material, the effectiveness of checking candidate template are utilized.
Specifically, first, utilize candidate template, extract the entity in checking language material. Then, the entity extracted and the concordance of the entity of mark in checking language material are compared.
In step s 4, according to the result, the relevant evaluation value in reference table is adjusted.
The principle adjusted is when the result shows that candidate template is effective, increases evaluation of estimate relevant to this candidate template, in reference table. Otherwise, evaluation of estimate relevant to this candidate template, in reference table is reduced. Adjust concrete numerical value, ratio, formula can by those skilled in the art's flexible design.
In step s 5, based on the reference table after adjusting, repeating said extracted (step S2), checking (step S3), adjusting (step S4), until meeting predetermined stoppage condition.
Predetermined stoppage condition includes: the candidate template that number of repetition reaches pre-determined number or this candidate template extracted was extracted with last time is identical.
Through some iteration of taking turns, reference table (evaluation of estimate) is adjusted being gradually improved, and the obvious dependable with function of candidate template extracted based on the reference table optimized is best.
Therefore, in step s 6, candidate template when predetermined stoppage condition being satisfied is defined as the template extracted.
It should be appreciated that while above-mentioned steps S1-S6 is described based on the example of outer template, but step S1-S6 is equally applicable to the training of internal template.
It addition, create based on above-mentioned corpus and checking language material and adjust reference table out, it is also possible to the template for other language material extracts, and other language material can be utilized to update reference table.
Such as, based on the template extracted, the entity in the new extension language material marked is extracted. When the entity extracted is consistent with the entity of the mark in extension language material, utilizes the word in extension language material or word, update the general character in reference table.
Such as, template (31) " digitalX1_l_1_2address [X] X1_r_0_1phonex_1_r_1_2digital " is utilized correctly to extract except the address in extension language material, and the X1_l_1_2 in template (31) is corresponding to by extensive general character " road ", that be correlated with in reference table is " X1_l_1_2 ", " street ", there is no " X1_l_1_2 ", " road ", then can increase entry " X1_l_1_2 ", " road ", " 1 " in reference table.
Then, according to the reference table after updating, it is also possible to from extension language material, extract new template.
Extracting method herein is similar with the iteration in above-mentioned steps S5, and simply language material is transformed to extension language material, therefore repeats no more.
It addition, the outer template extracted based on the template extraction method of embodiments of the invention and internal template may be used for extracting the entity in un-annotated data.
The entity of correspondence position just directly can be extracted from un-annotated data individually with outer template and internal template.
Compared with being used alone outer template and internal template, better entity can be obtained in conjunction with the mode using outer template and internal template and extract effect.
Fig. 3 illustrates the flow chart of the entity abstracting method according to the first embodiment of the present invention. As it is shown on figure 3, the entity abstracting method according to the first embodiment of the present invention comprises the steps:, based on the outer template extracted, to extract the first instance (step S31) in the un-annotated data made new advances; Based on the inside template extracted, extract the second instance (step S32) in described new un-annotated data; By the common factor of first instance and second instance, as the result (step S33) that entity extracts.
Fig. 4 illustrates the flow chart of entity abstracting method according to the second embodiment of the present invention. As shown in Figure 4, entity abstracting method according to the second embodiment of the present invention comprises the steps:, based on the outer template extracted, to extract the 3rd entity (step S41) in the un-annotated data made new advances; Based on the inside template extracted, filter the 3rd entity being drawn into, as the result (step S42) that entity extracts.
Fig. 5 illustrates the flow chart of entity abstracting method according to the third embodiment of the invention. As it is shown in figure 5, entity abstracting method according to the third embodiment of the invention comprises the steps:, based on the inside template extracted, to extract the 3rd entity (step S51) in the un-annotated data made new advances; Based on the outer template extracted, filter the 3rd entity being drawn into, as the result (step S52) that entity extracts.
Below, will be used for extracting the equipment of the template of entity with reference to Fig. 6 extraction described according to embodiments of the present invention.
Fig. 6 illustrates the block diagram extracting the equipment for the template that extracts entity according to embodiments of the present invention. As shown in Figure 6, template extraction equipment 600 according to the present invention includes: reference table creates device 61, it is used for: from corpus, creates reference table, and described reference table includes: meet general character and the combination of metacharacter, the corresponding evaluation of estimate of the first predetermined condition; Candidate template extraction element 62, it is used for: according to reference table, from corpus, extracts candidate template; Validation verification device 63, it is used for: utilize checking language material, the effectiveness of checking candidate template; Evaluation of estimate adjusting apparatus 64, it is used for: according to the result, adjust the relevant evaluation value in described reference table; And controlling device 65, it is used for: candidate template when described second predetermined condition being satisfied when second predetermined condition is defined as the template extracted.
In one embodiment, described template includes the outer template of the context for template extracting object and for the inside template of template extracting object itself, represent corresponding to the described general character of described outer template template extracting object context by an extensive part, number and the relative position of word that described general character is corresponding is represented corresponding to the described metacharacter of described outer template, corresponding to the described general character of described internal template represent template extracting object itself by an extensive part, number and the relative position of word that described general character is corresponding is represented corresponding to the described metacharacter of described internal template.
In one embodiment, described predetermined condition includes: the number of the word that the size of contextual window, general character are corresponding.
In one embodiment, described corpus and checking language material are all the documents being labelled with the entity that should extract, and described corpus is used for creating reference table and extracting candidate template, and described checking language material is used for verifying candidate template.
In one embodiment, described reference table creates device 61 and is used for: for each sentence in described corpus, the mark according to wherein entity for extracting, according to described predetermined condition, extracts described general character and metacharacter; According to the number of times that described general character occurs in described corpus, calculate corresponding evaluation of estimate.
In one embodiment, described candidate template extraction element 62 is used for: utilize beam-search (BeamSearch) algorithm or bottom-up CYK algorithm (Cocke Younger Kasamialgorithm) to carry out the extraction of described candidate template.
In one embodiment, described candidate template extraction element 62 is used for: for each sentence in described corpus, randomly chooses one or more word every time or phrase carries out extensive, to obtain preliminary template; Remove the preliminary template repeated and merge the preliminary template that there is inclusion relation, to obtain alternative templates; According to the corresponding evaluation of estimate meeting the general character of alternative templates and the combination of metacharacter in described reference table, calculate the score of alternative templates; Choose the alternative templates of the predetermined quantity of highest scoring, as described candidate template.
In one embodiment, described validation verification device 63 is used for: utilize candidate template, extracts the entity in checking language material; Compare the entity and the concordance of the entity of mark in checking language material that extract.
In one embodiment, institute's evaluation values adjusting apparatus 64 is used for: when the result shows that candidate template is effective, increases evaluation of estimate relevant to this candidate template, in described reference table; Otherwise, evaluation of estimate relevant to this candidate template, in described reference table is reduced.
In one embodiment, described predetermined stoppage condition includes: the candidate template that number of repetition reaches pre-determined number or this candidate template extracted was extracted with last time is identical.
In one embodiment, template extraction equipment 600 also includes: entity draw-out device, is used for: based on the template extracted, extract the entity in the new extension language material marked; Updating device, is used for: when the entity extracted is consistent with the entity of the mark in described extension language material, utilizes the word in extension language material or word, updates the general character in described reference table.
In one embodiment, template extraction equipment 600 also includes: template extraction unit, is used for: according to the reference table after updating, from described extension language material, extract new template.
In one embodiment, template extraction equipment 600 also includes: first instance extracting unit, is used for: based on the outer template extracted, extract the first instance in the un-annotated data made new advances; Second instance extracting unit, is used for: based on the inside template extracted, extract the second instance in described new un-annotated data; Extract result and determine unit, be used for: by the common factor of first instance and second instance, as the result that entity extracts.
In one embodiment, template extraction equipment 600 also includes: the 3rd entity extracting unit, is used for: based on one of the outer template extracted and internal template, extract the 3rd entity in the un-annotated data made new advances;Filter element, is used for: based on another template in the outer template extracted and internal template, filter the 3rd entity being drawn into, as the result that entity extracts.
Owing to each device included in the template extraction equipment 600 according to the present invention is similar with the process in each step included in template extraction method described above respectively with the process in unit, therefore to for purpose of brevity, the detailed description of these devices and unit is omitted at this.
Additionally, still need here it is noted that in the said equipment each component devices, unit can be configured by the mode of software, firmware, hardware or its combination. Configure spendable specific means or mode is well known to those skilled in the art, do not repeat them here. When being realized by software or firmware, to the computer (such as the general purpose computer 700 shown in Fig. 7) with specialized hardware structure, the program constituting this software is installed from storage medium or network, this computer is when being provided with various program, it is possible to perform various functions etc.
Fig. 7 illustrates the schematic block diagram that can be used for implementing the computer of method and apparatus according to embodiments of the present invention.
In the figure 7, CPU (CPU) 701 is according to the program stored in read only memory (ROM) 702 or the program various process of execution being loaded into random access memory (RAM) 703 from storage part 708. In RAM703, always according to needing to store the data required when CPU701 performs various process etc. CPU701, ROM702 and RAM703 are connected to each other via bus 704. Input/output interface 705 is also connected to bus 704.
Components described below is connected to input/output interface 705: importation 706 (including keyboard, mouse etc.), output part 707 (include display, such as cathode ray tube (CRT), liquid crystal display (LCD) etc., and speaker etc.), storage part 708 (including hard disk etc.), communications portion 709 (including NIC such as LAN card, modem etc.). Communications portion 709 performs communication process via network such as the Internet. As required, driver 710 can be connected to input/output interface 705. Detachable media 711 such as disk, CD, magneto-optic disk, semiconductor memory etc. can be installed in driver 710 as required so that the computer program read out is installed in storage part 708 as required.
When realizing above-mentioned series of processes by software, the program constituting software is installed from network such as the Internet or storage medium such as detachable media 711.
It will be understood by those of skill in the art that this storage medium be not limited to shown in Fig. 7 wherein have program stored therein and equipment distributes the detachable media 711 of the program that provides a user with separately. The example of detachable media 711 comprises disk (comprising floppy disk (registered trade mark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trade mark)) and semiconductor memory. Or, storage medium can be hard disk of comprising etc., wherein computer program stored in ROM702, storage part 708, and is distributed to user together with the equipment comprising them.
The present invention also proposes the program product that a kind of storage has the instruction code of machine-readable. When described instruction code is read by machine and performs, above-mentioned method according to embodiments of the present invention can be performed.
Correspondingly, the storage medium being used for carrying the program product of the instruction code that above-mentioned storage has machine-readable is also included within disclosure of the invention. Described storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc.
Herein above in the description of the specific embodiment of the invention, the feature described for a kind of embodiment and/or illustrate can use in one or more other embodiment in same or similar mode, combined with the feature in other embodiment, or substitute the feature in other embodiment.
It should be emphasized that term " include/comprise " refers to the existence of feature, key element, step or assembly herein when using, but it is not precluded from the existence of one or more further feature, key element, step or assembly or additional.
Additionally, the method for the present invention be not limited to specifications described in time sequencing perform, it is also possible to according to other time sequencing ground, concurrently or independently executable. Therefore, the technical scope of the present invention is not construed as limiting by the execution sequence of the method described in this specification.
Although the present invention having been disclosed already by the description of specific embodiments of the invention above, however, it is to be understood that above-mentioned all embodiments and example are all illustrative of, and nonrestrictive. Those skilled in the art can design the various amendments to the present invention, improvement or equivalent in the spirit and scope of claims. These amendments, improvement or equivalent should also be as being to be considered as included in protection scope of the present invention.
Remarks
1. extract the method being used for extracting the template of entity, including:
From corpus, creating reference table, described reference table includes: meet general character and the combination of metacharacter, the corresponding evaluation of estimate of the first predetermined condition;
According to reference table, from corpus, extract candidate template;
Utilize checking language material, the effectiveness of checking candidate template;
According to the result, adjust the relevant evaluation value in described reference table; And
When second predetermined condition, candidate template when the second predetermined condition is satisfied described in predetermined stoppage condition is defined as the template extracted.
2. the method as described in remarks 1, wherein, described template includes the outer template of the context for template extracting object and for the inside template of template extracting object itself, represent corresponding to the described general character of described outer template template extracting object context by an extensive part, number and the relative position of word that described general character is corresponding is represented corresponding to the described metacharacter of described outer template, corresponding to the described general character of described internal template represent template extracting object itself by an extensive part, number and the relative position of word that described general character is corresponding is represented corresponding to the described metacharacter of described internal template.
3. the method as described in remarks 1, wherein, described corpus and checking language material are all the documents being labelled with the entity that should extract, and described corpus is used for creating reference table and extracting candidate template, and described checking language material is used for verifying candidate template.
4. the method as described in remarks 1, wherein, described from corpus, create reference table and include:
For each sentence in described corpus, the mark according to wherein entity for extracting, according to described predetermined condition, extracts described general character and metacharacter;
According to the number of times that described general character occurs in described corpus, calculate corresponding evaluation of estimate.
5. the method as described in remarks 1, wherein, described according to reference table, from corpus, extract candidate template and include:
Beam-search (BeamSearch) algorithm or bottom-up CYK algorithm (Cocke Younger Kasamialgorithm) is utilized to carry out the extraction of described candidate template.
6. the method as described in remarks 5, wherein, described according to reference table, from corpus, extract candidate template and include:
For each sentence in described corpus, randomly choose one or more word every time or phrase carries out extensive, to obtain preliminary template;
Remove the preliminary template repeated and merge the preliminary template that there is inclusion relation, to obtain alternative templates;
According to the corresponding evaluation of estimate meeting the general character of alternative templates and the combination of metacharacter in described reference table, calculate the score of alternative templates;
Choose the alternative templates of the predetermined quantity of highest scoring, as described candidate template.
7. the method as described in remarks 1, wherein, the described checking language material that utilizes, the effectiveness of checking candidate template includes:
Utilize candidate template, extract the entity in checking language material;
Compare the entity and the concordance of the entity of mark in checking language material that extract.
8. the method as described in remarks 1, wherein, described according to the result, the relevant evaluation value adjusted in described reference table includes:
When the result shows that candidate template is effective, increase evaluation of estimate relevant to this candidate template, in described reference table;
Otherwise, evaluation of estimate relevant to this candidate template, in described reference table is reduced.
9. the method as described in remarks 1, wherein, described predetermined stoppage condition includes: the candidate template that number of repetition reaches pre-determined number or this candidate template extracted was extracted with last time is identical.
10. the method as described in remarks 1, also includes: described determine step after,
Based on the template extracted, extract the entity in the new extension language material marked;
When the entity extracted is consistent with the entity of the mark in described extension language material, utilizes the word in extension language material or word, update the general character in described reference table.
11. the method as described in remarks 10, also include:
According to the reference table after updating, from described extension language material, extract new template.
12. the method as described in remarks 1, also include: described determine step after,
Based on the outer template extracted, extract the first instance in the un-annotated data made new advances;
Based on the inside template extracted, extract the second instance in described new un-annotated data;
By the common factor of first instance and second instance, as the result that entity extracts.
13. the method as described in remarks 1, also include: described determine step after,
Based on one of the outer template extracted and internal template, extract the 3rd entity in the un-annotated data made new advances;
Based on another template in the outer template extracted and internal template, filter the 3rd entity being drawn into, as the result that entity extracts.
14. extract the equipment being used for extracting the template of entity, including:
Reference table creates device, and it is used for: from corpus, creates reference table, and described reference table includes: meet general character and the combination of metacharacter, the corresponding evaluation of estimate of the first predetermined condition;
Candidate template extraction element, it is used for: according to reference table, from corpus, extracts candidate template;
Validation verification device, it is used for: utilize checking language material, the effectiveness of checking candidate template;
Evaluation of estimate adjusting apparatus, it is used for: according to the result, adjust the relevant evaluation value in described reference table; And
Controlling device, it is used for: candidate template when the second predetermined condition is satisfied described in predetermined stoppage condition is defined as the template extracted when second predetermined condition.
15. the equipment as described in remarks 14, wherein, described template includes the outer template of the context for template extracting object and for the inside template of template extracting object itself, represent corresponding to the described general character of described outer template template extracting object context by an extensive part, number and the relative position of word that described general character is corresponding is represented corresponding to the described metacharacter of described outer template, corresponding to the described general character of described internal template represent template extracting object itself by an extensive part, number and the relative position of word that described general character is corresponding is represented corresponding to the described metacharacter of described internal template.
16. the equipment as described in remarks 14, wherein, described corpus is all the document being labelled with the entity that should extract, and described corpus is used for creating reference table and extracting candidate template, and described checking language material is used for verifying candidate template.
17. the equipment as described in remarks 14, wherein, described reference table creates device and is used for:
For each sentence in described corpus, the mark according to wherein entity for extracting, according to described predetermined condition, extracts described general character and metacharacter;
According to the number of times that described general character occurs in described corpus, calculate corresponding evaluation of estimate.
18. the equipment as described in remarks 14, wherein, described candidate template extraction element is used for:
For each sentence in described corpus, randomly choose one or more word every time or phrase carries out extensive, to obtain preliminary template;
Remove the preliminary template repeated and merge the preliminary template that there is inclusion relation, to obtain alternative templates;
According to the corresponding evaluation of estimate meeting the general character of alternative templates and the combination of metacharacter in described reference table, calculate the score of alternative templates;
Choose the alternative templates of the predetermined quantity of highest scoring, as described candidate template.
19. the equipment as described in remarks 14, wherein, described validation verification device is used for:
Utilize candidate template, extract the entity in checking language material;
Compare the entity and the concordance of the entity of mark in checking language material that extract.
20. the equipment as described in remarks 14, wherein, institute's evaluation values adjusting apparatus is used for:
When the result shows that candidate template is effective, increase evaluation of estimate relevant to this candidate template, in described reference table;
Otherwise, evaluation of estimate relevant to this candidate template, in described reference table is reduced.

Claims (10)

1. extract the method being used for extracting the template of entity, including:
From corpus, creating reference table, described reference table includes: meet general character and the combination of metacharacter, the corresponding evaluation of estimate of the first predetermined condition;
According to reference table, from corpus, extract candidate template;
Utilize checking language material, the effectiveness of checking candidate template;
According to the result, adjust the relevant evaluation value in described reference table; And
Candidate template when described second predetermined condition being satisfied when second predetermined condition is defined as the template extracted.
2. the method for claim 1, wherein, described template includes the outer template of the context for template extracting object and for the inside template of template extracting object itself, represent corresponding to the described general character of described outer template template extracting object context by an extensive part, number and the relative position of word that described general character is corresponding is represented corresponding to the described metacharacter of described outer template, corresponding to the described general character of described internal template represent template extracting object itself by an extensive part, number and the relative position of word that described general character is corresponding is represented corresponding to the described metacharacter of described internal template.
The method of claim 1, wherein 3. described from corpus, create reference table and include:
For each sentence in described corpus, the mark according to wherein entity for extracting, according to described predetermined condition, extracts described general character and metacharacter;
According to the number of times that described general character occurs in described corpus, calculate corresponding evaluation of estimate.
The method of claim 1, wherein 4. described according to reference table, from corpus, extract candidate template and include:
For each sentence in described corpus, randomly choose one or more word every time or phrase carries out extensive, to obtain preliminary template;
Remove the preliminary template repeated and merge the preliminary template that there is inclusion relation, to obtain alternative templates;
According to the corresponding evaluation of estimate meeting the general character of alternative templates and the combination of metacharacter in described reference table, calculate the score of alternative templates;
Choose the alternative templates of the predetermined quantity of highest scoring, as described candidate template.
5. the method for claim 1, wherein described utilization verifies language material, and the effectiveness of checking candidate template includes:
Utilize candidate template, extract the entity in checking language material;
Compare the entity and the concordance of the entity of mark in checking language material that extract.
6. the method for claim 1, wherein described according to the result, the relevant evaluation value adjusted in described reference table includes:
When the result shows that candidate template is effective, increase evaluation of estimate relevant to this candidate template, in described reference table;
Otherwise, evaluation of estimate relevant to this candidate template, in described reference table is reduced.
7. the method for claim 1, also includes: described determine step after,
Based on the template extracted, extract the entity in the new extension language material marked;
When the entity extracted is consistent with the entity of the mark in described extension language material, utilizes the word in extension language material or word, update the general character in described reference table.
8. the method for claim 1, also includes: described determine step after,
Based on the outer template extracted, extract the first instance in the un-annotated data made new advances;
Based on the inside template extracted, extract the second instance in described new un-annotated data;
By the common factor of first instance and second instance, as the result that entity extracts.
9. the method for claim 1, also includes: described determine step after,
Based on one of the outer template extracted and internal template, extract the 3rd entity in the un-annotated data made new advances;
Based on another template in the outer template extracted and internal template, filter the 3rd entity being drawn into, as the result that entity extracts.
10. extract the equipment being used for extracting the template of entity, including:
Reference table creates device, and it is used for: from corpus, creates reference table, and described reference table includes: meet general character and the combination of metacharacter, the corresponding evaluation of estimate of the first predetermined condition;
Candidate template extraction element, it is used for: according to reference table, from corpus, extracts candidate template;
Validation verification device, it is used for: utilize checking language material, the effectiveness of checking candidate template;
Evaluation of estimate adjusting apparatus, it is used for: according to the result, adjust the relevant evaluation value in described reference table; And
Controlling device, it is used for: candidate template when described second predetermined condition being satisfied when second predetermined condition is defined as the template extracted.
CN201410663066.9A 2014-11-19 2014-11-19 Method and device for taking temperature for extracting entities Pending CN105677632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410663066.9A CN105677632A (en) 2014-11-19 2014-11-19 Method and device for taking temperature for extracting entities

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410663066.9A CN105677632A (en) 2014-11-19 2014-11-19 Method and device for taking temperature for extracting entities

Publications (1)

Publication Number Publication Date
CN105677632A true CN105677632A (en) 2016-06-15

Family

ID=56945655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410663066.9A Pending CN105677632A (en) 2014-11-19 2014-11-19 Method and device for taking temperature for extracting entities

Country Status (1)

Country Link
CN (1) CN105677632A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598945A (en) * 2016-12-02 2017-04-26 北京小米移动软件有限公司 Template inspection method and device
WO2020087949A1 (en) * 2018-11-01 2020-05-07 北京市商汤科技开发有限公司 Database updating method and device, electronic device, and computer storage medium
CN111858900A (en) * 2020-09-21 2020-10-30 杭州摸象大数据科技有限公司 Method, device, equipment and storage medium for generating question semantic parsing rule template
CN113408271A (en) * 2021-06-16 2021-09-17 北京来也网络科技有限公司 Information extraction method, device, equipment and medium based on RPA and AI

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5757287A (en) * 1992-04-24 1998-05-26 Hitachi, Ltd. Object recognition system and abnormality detection system using image processing
WO2004013779A2 (en) * 2002-08-01 2004-02-12 Accenture Global Services Gmbh Change navigation toolkit
CN101625695A (en) * 2009-08-20 2010-01-13 中国科学院计算技术研究所 Method and system for extracting complex named entities from Web video p ages
CN102129422A (en) * 2010-01-14 2011-07-20 富士通株式会社 Template extraction method and device
CN102457817A (en) * 2010-10-15 2012-05-16 北大方正集团有限公司 Method and system for extracting news contents from mobile phone newspaper
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5757287A (en) * 1992-04-24 1998-05-26 Hitachi, Ltd. Object recognition system and abnormality detection system using image processing
WO2004013779A2 (en) * 2002-08-01 2004-02-12 Accenture Global Services Gmbh Change navigation toolkit
CN101625695A (en) * 2009-08-20 2010-01-13 中国科学院计算技术研究所 Method and system for extracting complex named entities from Web video p ages
CN102129422A (en) * 2010-01-14 2011-07-20 富士通株式会社 Template extraction method and device
CN102457817A (en) * 2010-10-15 2012-05-16 北大方正集团有限公司 Method and system for extracting news contents from mobile phone newspaper
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
OKA M 等: "Extracting topics from weblogs through frequency segments", 《/PROCEEDINGS OF THE WWW06 WORKSHOP ON WEB》 *
时达明 等: "基于模板化的 Blog 信息抽取", 《计算机工程与应用》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598945A (en) * 2016-12-02 2017-04-26 北京小米移动软件有限公司 Template inspection method and device
CN106598945B (en) * 2016-12-02 2019-08-06 北京小米移动软件有限公司 The template method of inspection and device
WO2020087949A1 (en) * 2018-11-01 2020-05-07 北京市商汤科技开发有限公司 Database updating method and device, electronic device, and computer storage medium
CN111125391A (en) * 2018-11-01 2020-05-08 北京市商汤科技开发有限公司 Database updating method and device, electronic equipment and computer storage medium
CN111858900A (en) * 2020-09-21 2020-10-30 杭州摸象大数据科技有限公司 Method, device, equipment and storage medium for generating question semantic parsing rule template
CN111858900B (en) * 2020-09-21 2020-12-25 杭州摸象大数据科技有限公司 Method, device, equipment and storage medium for generating question semantic parsing rule template
CN113408271A (en) * 2021-06-16 2021-09-17 北京来也网络科技有限公司 Information extraction method, device, equipment and medium based on RPA and AI

Similar Documents

Publication Publication Date Title
US7349839B2 (en) Method and apparatus for aligning bilingual corpora
CN110941716B (en) Automatic construction method of information security knowledge graph based on deep learning
CN105677632A (en) Method and device for taking temperature for extracting entities
CN102855263A (en) Method and device for aligning sentences in bilingual corpus
CN105335360B (en) The method and apparatus for generating file structure
CA2861469A1 (en) Method and apparatus to construct program for assisting in reviewing
CN107133209A (en) Comment generation method and device, equipment and computer-readable recording medium based on artificial intelligence
CN102955908A (en) Method and device for creating rhythm password and carrying out verification according to rhythm password
CN108140091A (en) Loophole finds that device, loophole find that method and loophole find program
CN104008166A (en) Dialogue short text clustering method based on form and semantic similarity
WO2016176004A1 (en) Confidence estimation and bug prediction for machine translation
CN109635297A (en) A kind of entity disambiguation method, device, computer installation and computer storage medium
CN108108349A (en) Long text error correction method, device and computer-readable medium based on artificial intelligence
CN110516251B (en) Method, device, equipment and medium for constructing electronic commerce entity identification model
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN103678271A (en) Text correction method and user equipment
CN102402505B (en) Delta language translation method and system
Zhang et al. Sequence-to-sequence pre-training with data augmentation for sentence rewriting
CN103678371B (en) Word library updating device, data integration device and method and electronic equipment
CN103377186B (en) Based on the web service integration of named entity recognition, method and equipment
CN103116575A (en) Translated text word order probability determination method and device based on gradation phrase model
CN103678318A (en) Multi-word unit extraction method and equipment and artificial neural network training method and equipment
CN108228557B (en) Sequence labeling method and device
US10796005B1 (en) Method of application security vulnerability evaluation based on tree boosting, and readable medium and apparatus for performing the method
CN107977454A (en) The method, apparatus and computer-readable recording medium of bilingual corpora cleaning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160615

WD01 Invention patent application deemed withdrawn after publication