CN108959286A - Information extraction method and information extraction equipment - Google Patents

Information extraction method and information extraction equipment Download PDF

Info

Publication number
CN108959286A
CN108959286A CN201710350902.1A CN201710350902A CN108959286A CN 108959286 A CN108959286 A CN 108959286A CN 201710350902 A CN201710350902 A CN 201710350902A CN 108959286 A CN108959286 A CN 108959286A
Authority
CN
China
Prior art keywords
word
adjacent
sentence
attribute value
term vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710350902.1A
Other languages
Chinese (zh)
Inventor
张波
孟遥
孙俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201710350902.1A priority Critical patent/CN108959286A/en
Publication of CN108959286A publication Critical patent/CN108959286A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of information extraction methods and information extraction equipment.The information extraction method includes: to carry out noun phrase NP identification and mark to sentence;Each word in sentence is converted into term vector;For the sentence at least two NP, the term vector of the word near the word between two adjacent NP and mark are input in Relation extraction model, whether are relatives with the word between the described two adjacent NP of determination;And according to relative definitive result, judge between described two adjacent NP with the presence or absence of relationship.

Description

Information extraction method and information extraction equipment
Technical field
This invention relates generally to technical field of information processing.Specifically, the present invention relates to one kind can without guidance ground, Rapidly, the method and apparatus for carrying out opening imformation extraction to favorable expandability.
Background technique
With the development of information technology, the data volume of information is carried with geometric growth.It is extracted from the data of magnanimity Effective information is vital.Opening imformation extraction technique aims to solve the problem that this problem.Opening imformation extraction technique institute needle Pair corpus and be not secured to a certain field, the result of extraction generally exists in the form of triple { entity, relationship, entity }.
There are mainly three types of traditional opening imformation extraction techniques.The first opening imformation abstracting method is dependent on artificial mark Note, the method for being consequently belonging to guidance.Classifier is trained using the corpus manually marked, housebroken classifier can handle New corpus comprising artificial marked content.The shortcomings that this method, is if including not to be trained to (if also in new corpus Not mark manually) content, this method can not extract relevant information.It for example, include artificial mark in training corpus " XX is born in YY " infused can not take out then housebroken classifier can only extract the information of " XX is born in YY " type Take out the information of " XX is born in YY " and " birthday of XX is YY " type.
Second of opening imformation abstracting method depends on the rule artificially formulated.Obviously, rule execute get up comparison it is mechanical, Scalability is bad.Moreover, second method generally requires carry out syntactic analysis, cause processing speed slower.
The third opening imformation abstracting method reference semantics information such as syntactic parser.But this meeting greatly shadow Ring processing speed.
Therefore, the present invention is directed to propose a kind of side for without guidance, rapidly, favorable expandability carrying out opening imformation extraction Method and equipment.
Summary of the invention
It has been given below about brief overview of the invention, in order to provide about the basic of certain aspects of the invention Understand.It should be appreciated that this summary is not an exhaustive overview of the invention.It is not intended to determine pass of the invention Key or pith, nor is it intended to limit the scope of the present invention.Its purpose only provides certain concepts in simplified form, Taking this as a prelude to a more detailed description discussed later.
The purpose of the present invention is to propose to one kind can without guidance ground, rapidly, favorable expandability carry out opening imformation extraction Method and apparatus.
To achieve the goals above, according to an aspect of the invention, there is provided a kind of information extraction method, the information are taken out Taking method includes: to carry out noun phrase NP identification and mark to sentence;Each word in sentence is converted into term vector;For tool There is the sentence of at least two NP, the term vector of the word near the word between two adjacent NP and mark are input to Relation extraction mould It whether is relative with the word between the described two adjacent NP of determination in type;And according to relative definitive result, described in judgement It whether there is relationship between two adjacent NP.
According to another aspect of the present invention, a kind of information extraction equipment is provided, which includes: identification Annotation equipment is configured as: carrying out noun phrase NP identification and mark to sentence;Term vector conversion equipment, is configured as: by sentence Each word in son is converted to term vector;Relative determining device, is configured as:, will for the sentence at least two NP The term vector and mark of the word near word between two adjacent NP are input in Relation extraction model, with the described two phases of determination Whether the word between adjacent NP is relative;And relationship judgment means, it is configured as: according to relative definitive result, judging institute It states between two adjacent NP with the presence or absence of relationship.
In addition, according to another aspect of the present invention, additionally providing a kind of storage medium.The storage medium includes that machine can The program code of reading, when executing said program code on information processing equipment, said program code makes at the information Equipment is managed to execute according to the above method of the present invention.
In addition, in accordance with a further aspect of the present invention, additionally providing a kind of program product.Described program product includes that machine can The instruction of execution, when executing described instruction on information processing equipment, described instruction executes the information processing equipment According to the above method of the present invention.
Detailed description of the invention
Referring to reference to the accompanying drawing to the explanation of the embodiment of the present invention, the invention will be more easily understood it is above and Other objects, features and advantages.Component in attached drawing is intended merely to show the principle of the present invention.In the accompanying drawings, identical or class As technical characteristic or component will be indicated using same or similar appended drawing reference.In attached drawing:
Fig. 1 shows the first method of trained Relation extraction model;
Fig. 2 shows the examples of structured message;
Fig. 3 shows the second method of trained Relation extraction model;
Fig. 4 shows the flow chart of the information extraction method of embodiment according to the present invention;
Fig. 5 shows the structural block diagram of the information extraction equipment of embodiment according to the present invention;
Fig. 6 shows the schematic frame for the computer that can be used for implementing the method and apparatus of embodiment according to the present invention Figure.
Specific embodiment
Exemplary embodiment of the invention is described in detail hereinafter in connection with attached drawing.It rises for clarity and conciseness See, does not describe all features of actual implementation mode in the description.It should be understood, however, that developing any this reality Much decisions specific to embodiment must be made during embodiment, to realize the objectives of developer, For example, meeting restrictive condition those of related to system and business, and these restrictive conditions may be with embodiment It is different and change.In addition, it will also be appreciated that although development is likely to be extremely complex and time-consuming, to benefit For those skilled in the art of present disclosure, this development is only routine task.
Here, and also it should be noted is that, in order to avoid having obscured the present invention because of unnecessary details, in the accompanying drawings Illustrate only with closely related apparatus structure and/or processing step according to the solution of the present invention, and be omitted and the present invention The little other details of relationship.In addition, it may also be noted that being described in an attached drawing of the invention or a kind of embodiment Elements and features can be combined with elements and features shown in one or more other attached drawings or embodiment.
The basic idea of the invention is that using having the Baidupedia that relational sentence is more, has structured message, Wiki hundred The encyclopaedias such as section obtain largely training sentence, obtain training corpus therefrom to avoid needing to instruct (artificial mark), pass through term vector Semanteme is introduced with Relation extraction model to improve scalability, due to not using syntactic analysis and syntactic parser, institute It is fast with processing speed.
Relation extraction model is introduced first.
Relation extraction model includes sequence labelling model, disaggregated model etc..The typical case of Relation extraction model is multilayer Perceptron model.Multiple perceptron model at least includes three layers: input layer, hidden layer, output layer.Multiple perceptron model is general It is applied to the scene to label, such as sequence labelling, part-of-speech tagging.Hereinafter using multiple perceptron model as Relation extraction Model is described.But other sequence labelling models such as chain type condition random field CRF, disaggregated model such as naive Bayesian mould Type can also be used for the present invention.
Fig. 1 shows the first method of trained Relation extraction model.As shown in Figure 1, this method comprises: believing from structuring Entity, attribute value (step S1) are obtained in breath;In the relational sentence comprising above-mentioned entity, attribute value, by above-mentioned entity, category Property value is labeled as NP, and the word between above-mentioned entity, attribute value is labeled as relative REL (step S2);And REL will be labeled as Word near word term vector and mark be input in Relation extraction model, be trained (step S3).
The relational sentence that is utilized of the present invention and structured message can it is more from relational sentence, have structured message Channel obtains.Hereinafter it is illustrated by taking encyclopaedia as an example.Encyclopaedia is, for example, Baidupedia, wikipedia etc..Have in encyclopaedia Many relational sentences, may be used as training corpus, for example, " Zhang Daqian is born on May 10th, 1899 ".There are also very in encyclopaedia More structured messages extract convenient for automatic.Fig. 2 shows the structured message InfoBox in Baidupedia about Zhang Daqian.
In step sl, entity, attribute value are obtained from structured message.
Since entity and attribute value are all noun phrase NP, and structured message has the structure of height rule, so Noun phrase NP can be extracted from structured message easily as entity and attribute value, and it is real which, which can tell, Body, which is attribute value.Specifically, noun phrase NP can be extracted using existing tool, as entity and attribute value.? Regular such as " noun+noun=noun phrase " be can use to extract entity and attribute value.For example, can be from shown in Fig. 2 Entity " Zhang Daqian ", attribute value " on May 10th, 1899 " are extracted in structured message.
In step s 2, in the relational sentence comprising above-mentioned entity, attribute value, above-mentioned entity, attribute value are marked For NP, the word between above-mentioned entity, attribute value is labeled as relative REL.
As described above, there are many relational sentences in encyclopaedia.Such as " Zhang Daqian is born on May 10th, 1899 ".? Entity " Zhang Daqian " and attribute value " on May 10th, 1899 " are had realised that in step S1.Therefore, known entities can included In the relational sentence " Zhang Daqian is born on May 10th, 1899 " of " Zhang Daqian " and known attribute value " on May 10th, 1899 ", Entity " Zhang Daqian " and attribute value " on May 10th, 1899 " are labeled as NP, by entity " Zhang Daqian " and attribute value " 1899 5 Word " birth ", " in " between the moon 10 " are labeled as relative REL.
In addition, due to the relational sentence in encyclopaedia there may be repeat, so in encyclopaedia include same entity and Multiple relational sentences of same attribute value can only mark first pass comprising the entity, the attribute value in step s 2 It is property sentence.
As a result, by means of encyclopaedia, automatic marking is realized, without manually marking.Therefore, it is according to the method for the present invention It is guideless.
In step s3, the term vector for the word being labeled as near the word of REL and mark are input in Relation extraction model, It is trained.
For example, extracting entity " Zhang Daqian " (mark from relational sentence " Zhang Daqian is born on May 10th, 1899 " For NP), relationship " birth " (being labeled as REL), " in " (being labeled as REL), attribute value " on May 10th, 1899 " (being labeled as NP) ".
In this illustration, being labeled as the word of the word " birth " of REL nearby includes " Zhang Daqian ", " in ".It is labeled as REL's The word of word " in " nearby includes " birth ", " 1899 ".The range of neighbouring word can flexibly be set, such as be labeled as REL's The word of word " in " nearby may also include " Zhang Daqian ", " May " etc..
Neighbouring word is converted into term vector, and term vector, mark are all input in Relation extraction model to carry out Training.Relation extraction model is by training, it will be able to judge whether this word should be labeled as according to the word near a word REL, to show relationship.
It can use transformation model and word be converted into term vector.Transformation model is based on the training of relational sentence and obtains.Training The relational sentence of transformation model is not limited to use in the relational sentence of trained Relation extraction model.For example, for training relationship The relational sentence of extraction model is related to " being born in ", for training the relational sentence of transformation model to be related to " being born in ", " Birthday is ".Term vector introduces semantic information, because being understood when word is converted to term vector according to the word near this word The semanteme of this word, to assign its term vector, each element position of term vector has certain semantic information.Semantic phase The term vector of close word is closer to each other in vector space, because the word near them is similar.For example, " being born in ", " birth In ", the term vector of " birthday be " it is closer to each other in vector space.Semantic information is conducive to extend.In the above example, " being born in " in sentence " neat white stone is born on January 1st, 1864 " can be converted to term vector using transformation model in the future, And Relation extraction model can recognize that " birth ", " in " should be labeled as REL, because of the term vector of " birth " and " birth " Term vector similarity is high.
In the second method of training Relation extraction model shown in Fig. 3, step S31, S32 respectively with it is shown in FIG. 1 Step S1, S2 of the first method of training Relation extraction model is identical, and difference is to replace step S3 with step S33.
In step S33, by the relational sentence inputting of the term vector form with mark into Relation extraction model, into Row training.
By this method, Relation extraction model can be mixed identification.For example, being labeled as the one of REL in relational sentence A word REL1 and another word REL2 for being labeled as REL are in sentence and non-conterminous.According to method shown in Fig. 2, can not know Relationship between REL1 and REL2.When being labeled using Relation extraction model to corpus, if REL1 and REL2 in corpus Adjacent, then Relation extraction model can not correctly be judged.If method training Relation extraction model according to Fig.3, Relation extraction model will appreciate that the markup information in whole sentence, therefore can correctly cope in said circumstances to neighbouring relations word The judgement of REL1 and REL2.
As the improvement to Fig. 1 and method shown in Fig. 3, step can also be increased: to not comprising above-mentioned entity, attribute value Relational sentence carry out NP identification and mark.Then, by the relational sentence inputting of the term vector form with mark to pass It is to be trained in extraction model.Note, neither the word that NP is also not REL is also marked, such as is labeled as O.The mesh done so Be so that Relation extraction model not only learns positive example, but also learns counter-example.
It can be used for information extraction method according to the present invention by as above trained Relation extraction model.
The process of the information extraction method of embodiment according to the present invention is described below with reference to Fig. 4.
Fig. 4 shows the flow chart of the information extraction method of embodiment according to the present invention.As shown in figure 4, the information is taken out Method is taken to include the following steps: to carry out noun phrase NP identification and mark (step S41) to sentence;Each word in sentence is turned It is changed to term vector (step S42);For the sentence at least two NP, by the word of the word near the word between two adjacent NP Whether vector sum mark is input in Relation extraction model, be relative (step with the word between the described two adjacent NP of determination S43);And according to relative definitive result, judge between described two adjacent NP with the presence or absence of relationship (step S44).
In step S41, noun phrase NP identification and mark are carried out to sentence.
As set forth above, it is possible to extract noun phrase NP using existing tool.Also rule such as " noun+name be can use Word=noun phrase " etc. extracts noun phrase NP.
In step S42, each word in sentence is converted into term vector.
As set forth above, it is possible to which each word in sentence is converted to term vector using transformation model, transformation model is based on closing It is that the training of property sentence obtains.
In step S43, for the sentence at least two NP, by the word of the word near the word between two adjacent NP Whether vector sum mark is input in Relation extraction model, be relative with the word between the described two adjacent NP of determination.
Since the form of the result of opening imformation extraction is triple { entity, relationship, entity }, so a relationship is minimum There are two NP.Word between NP is possible to be relative REL.Due to training when corpus include at least be noted as REL's Word (term vector and mark) near word, so input Relation extraction model in application is the word between two adjacent NP Near word (term vector and mark).Certainly, the entire sentence marked can also be inputted here.Relation extraction model pair Input is scanned, and judges whether current word should be labeled as according to the sum of the term vector of word near the current word in scanning window REL。
It is that Relation extraction model returns the result shows that whether word between two adjacent NP should be labeled as REL, i.e. two phases Whether the word between adjacent NP is relative.
In step S44, according to relative definitive result, judge between described two adjacent NP with the presence or absence of relationship.
Specifically, if the relatival ratio in word between described two adjacent NP is greater than predetermined threshold, judge There are relationships between two adjacent NP;If the relatival ratio in word between described two adjacent NP is less than or equal to Predetermined threshold is then judged as between two adjacent NP that there is no relationships.Predetermined threshold is specified by those skilled in the art.For example, Word between two adjacent NP has 10 words, wherein only 2 words are marked as REL, then the two NP are likely to be not present and close System.For example, the word between two adjacent NP has 3 words, wherein there is 2 words to be marked as REL, then the two NP are likely that there are Relationship.
Being judged as between two adjacent NP that there are in the case where relationship, extract triple { in described two adjacent NP First NP;Word between two adjacent NP;Second NP in described two adjacent NP }, it is { real that configuration information extracts result Body, relationship, entity }.
It can also be being judged as between two adjacent NP that there are in the case where relationship, extract triple { described two phases First NP in adjacent NP;The relative in word between two adjacent NP;Second NP in described two adjacent NP }, structure At information extraction result { entity, relationship, entity }.
The information extraction equipment of embodiment according to the present invention is described next, with reference to Fig. 5.
Fig. 5 shows the structural block diagram of the information extraction equipment of embodiment according to the present invention.As shown in figure 5, according to Information extraction equipment 500 of the invention includes: identification annotation equipment 51, is configured as: carrying out noun phrase NP identification to sentence And mark;Term vector conversion equipment 52, is configured as: each word in sentence is converted to term vector;Relative determining device 53, it is configured as: for the sentence at least two NP, by the term vector and mark of the word near the word between two adjacent NP Whether note is input in Relation extraction model, be relative with the word between the described two adjacent NP of determination;And relationship judgement Device 54, is configured as: according to relative definitive result, judging between described two adjacent NP with the presence or absence of relationship.
In one embodiment, the term vector conversion equipment 52 includes transformation model, for by each word in sentence Term vector is converted to, the transformation model is based on the training of relational sentence and obtains.
In one embodiment, information extraction equipment 500 further include: Relation extraction model training apparatus is configured as: from Entity, attribute value are obtained in structured message;In the relational sentence comprising above-mentioned entity, attribute value, by above-mentioned entity, category Property value is labeled as NP, and the word between above-mentioned entity, attribute value is labeled as relative REL;It will be labeled as the word near the word of REL Term vector and mark be input in Relation extraction model, be trained.
In one embodiment, information extraction equipment 500 further include: Relation extraction model training apparatus is configured as: from Entity, attribute value are obtained in structured message;In the relational sentence comprising above-mentioned entity, attribute value, by above-mentioned entity, category Property value is labeled as NP, and the word between above-mentioned entity, attribute value is labeled as relative REL;By the term vector form with mark Relational sentence inputting into Relation extraction model, be trained.
In one embodiment, the Relation extraction model training apparatus is further configured to: to not comprising above-mentioned reality Body, the relational sentence progress NP identification of attribute value and mark;By the relational sentence inputting of the term vector form with mark Into Relation extraction model, it is trained.
In one embodiment, the relationship judgment means 54 are further configured to: if described two adjacent NP it Between word in relatival ratio be greater than predetermined threshold, then be judged as between two adjacent NP that there are relationships;If described two The relatival ratio in word between a adjacent NP is less than or equal to predetermined threshold, then is judged as between two adjacent NP and does not deposit In relationship.
In one embodiment, information extraction equipment 500 further include: result generating means are configured as: being judged as two There are in the case where relationship, extract triple { first NP in described two adjacent NP between a adjacent NP;Two adjacent Word between NP;Second NP in described two adjacent NP }.
In one embodiment, information extraction equipment 500 further include: result generating means are configured as: being judged as two There are in the case where relationship, extract triple { first NP in described two adjacent NP between a adjacent NP;Two adjacent The relative in word between NP;Second NP in described two adjacent NP }.
In one embodiment, relational sentence and structured message are obtained from encyclopaedia, and described includes above-mentioned entity, category Property value relational sentence be in encyclopaedia first comprising above-mentioned entity, attribute value relational sentence.
In one embodiment, Relation extraction model includes sequence labelling model, disaggregated model.
Due to the processing difference in each device and unit included in information extraction equipment 500 according to the present invention It is similar with the processing in each step included in information as described above abstracting method, therefore for simplicity, herein Omit the detailed description of these devices and unit.
In addition, it is still necessary to, it is noted that each component devices, unit can be by softwares, firmware, hard in above equipment here The mode of part or combinations thereof is configured.It configures workable specific means or mode is well known to those skilled in the art, This is repeated no more.In the case where being realized by software or firmware, from storage medium or network to specialized hardware structure Computer (such as general purpose computer 600 shown in fig. 6) installation constitutes the program of the software, which is being equipped with various journeys When sequence, it is able to carry out various functions etc..
Fig. 6 shows the schematic frame for the computer that can be used for implementing the method and apparatus of embodiment according to the present invention Figure.
In Fig. 6, central processing unit (CPU) 601 is according to the program stored in read-only memory (ROM) 602 or from depositing The program that storage part 608 is loaded into random access memory (RAM) 603 executes various processing.In RAM 603, also according to need Store the data required when CPU 601 executes various processing etc..CPU 601, ROM 602 and RAM 603 are via bus 604 are connected to each other.Input/output interface 605 is also connected to bus 604.
Components described below is connected to input/output interface 605: importation 606 (including keyboard, mouse etc.), output section Divide 607 (including display, such as cathode-ray tube (CRT), liquid crystal display (LCD) etc. and loudspeakers etc.), storage section 608 (including hard disks etc.), communications portion 609 (including network interface card such as LAN card, modem etc.).Communications portion 609 Communication process is executed via network such as internet.As needed, driver 610 can be connected to input/output interface 605. Detachable media 611 such as disk, CD, magneto-optic disk, semiconductor memory etc., which can according to need, is installed in driver On 610, so that the computer program read out is mounted to as needed in storage section 608.
It is such as removable from network such as internet or storage medium in the case where series of processes above-mentioned by software realization Unload the program that the installation of medium 611 constitutes software.
It will be understood by those of skill in the art that this storage medium be not limited to it is shown in fig. 6 be wherein stored with program, Separately distribute with equipment to provide a user the detachable media 611 of program.The example of detachable media 611 includes disk (including floppy disk (registered trademark)), CD (including compact disc read-only memory (CD-ROM) and digital versatile disc (DVD)), magneto-optic disk (including mini-disk (MD) (registered trademark)) and semiconductor memory.Alternatively, storage medium can be ROM 602, storage section Hard disk for including in 508 etc., wherein computer program stored, and user is distributed to together with the equipment comprising them.
The present invention also proposes a kind of program product of instruction code for being stored with machine-readable.Described instruction code is by machine When device reads and executes, method that above-mentioned embodiment according to the present invention can be performed.
Correspondingly, it is also wrapped for carrying the storage medium of the program product of the above-mentioned instruction code for being stored with machine-readable It includes in disclosure of the invention.The storage medium includes but is not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc. Deng.
In the description above to the specific embodiment of the invention, for the feature a kind of embodiment description and/or shown It can be used in one or more other embodiments in a manner of same or similar, with the feature in other embodiment It is combined, or the feature in substitution other embodiment.
It should be emphasized that term "comprises/comprising" refers to the presence of feature, element, step or component when using herein, but simultaneously It is not excluded for the presence or additional of one or more other features, element, step or component.
In addition, method of the invention be not limited to specifications described in time sequencing execute, can also according to it His time sequencing, concurrently or independently execute.Therefore, the execution sequence of method described in this specification is not to this hair Bright technical scope is construed as limiting.
Although being had been disclosed above by the description to specific embodiments of the present invention to the present invention, it answers The understanding, above-mentioned all embodiments and example are exemplary, and not restrictive.Those skilled in the art can be in institute Design is to various modifications of the invention, improvement or equivalent in attached spirit and scope of the claims.These modification, improve or Person's equivalent should also be as being to be considered as included in protection scope of the present invention.
Note
1. a kind of information extraction method, comprising:
Noun phrase NP identification and mark are carried out to sentence;
Each word in sentence is converted into term vector;
For the sentence at least two NP, by the term vector of the word near the word between two adjacent NP and mark defeated Enter into Relation extraction model, whether is relative with the word between the described two adjacent NP of determination;And
According to relative definitive result, judge between described two adjacent NP with the presence or absence of relationship.
2. the method as described in note 1, wherein each word in sentence is converted into term vector using transformation model, Middle transformation model is based on the training of relational sentence and obtains.
3. the method as described in note 1, wherein training obtains Relation extraction model as follows:
Entity, attribute value are obtained from structured message;
In the relational sentence comprising above-mentioned entity, attribute value, above-mentioned entity, attribute value are labeled as NP, it will be above-mentioned Word between entity, attribute value is labeled as relative REL;
The term vector for the word being labeled as near the word of REL and mark are input in Relation extraction model, are trained.
4. the method as described in note 1, wherein training obtains Relation extraction model as follows:
Entity, attribute value are obtained from structured message;
In the relational sentence comprising above-mentioned entity, attribute value, above-mentioned entity, attribute value are labeled as NP, it will be above-mentioned Word between entity, attribute value is labeled as relative REL;
By the relational sentence inputting of the term vector form with mark into Relation extraction model, it is trained.
5. the method as described in note 3 or 4, further includes:
NP identification and mark are carried out to the relational sentence for not including above-mentioned entity, attribute value;
By the relational sentence inputting of the term vector form with mark into Relation extraction model, it is trained.
6. note 1 as described in method, wherein according to relative definitive result, judge be between described two adjacent NP It is no that there are relationships to include:
If the relatival ratio in word between described two adjacent NP is greater than predetermined threshold, it is judged as two phases There are relationships between adjacent NP;
If the relatival ratio in word between described two adjacent NP is less than or equal to predetermined threshold, it is judged as Relationship is not present between two adjacent NP.
7. the method as described in note 1, further includes:
Being judged as between two adjacent NP that there are in the case where relationship, extract triple { in described two adjacent NP First NP;Word between two adjacent NP;Second NP in described two adjacent NP }.
8. the method as described in note 1, further includes:
Being judged as between two adjacent NP that there are in the case where relationship, extract triple { in described two adjacent NP First NP;The relative in word between two adjacent NP;Second NP in described two adjacent NP }.
9. such as method described in note 3, wherein relational sentence and structured message are obtained from encyclopaedia, described to include Above-mentioned entity, attribute value relational sentence be in encyclopaedia first comprising above-mentioned entity, attribute value relational sentence.
10. the method as described in note 1, wherein Relation extraction model includes sequence labelling model, disaggregated model.
11. a kind of information extraction equipment, comprising:
It identifies annotation equipment, is configured as: noun phrase NP identification and mark are carried out to sentence;
Term vector conversion equipment, is configured as: each word in sentence is converted to term vector;
Relative determining device, is configured as: for the sentence at least two NP, by the word between two adjacent NP The term vector and mark of neighbouring word are input in Relation extraction model, with the word between the described two adjacent NP of determination whether be Relative;And
Relationship judgment means, are configured as: according to relative definitive result, judging whether deposit between described two adjacent NP In relationship.
12. the equipment as described in note 11, wherein the term vector conversion equipment includes transformation model, is used for sentence In each word be converted to term vector, the transformation model is based on the training of relational sentence and obtains.
13. the equipment as described in note 11, further includes: Relation extraction model training apparatus is configured as:
Entity, attribute value are obtained from structured message;
In the relational sentence comprising above-mentioned entity, attribute value, above-mentioned entity, attribute value are labeled as NP, it will be above-mentioned Word between entity, attribute value is labeled as relative REL;
The term vector for the word being labeled as near the word of REL and mark are input in Relation extraction model, are trained.
14. the equipment as described in note 11, further includes: Relation extraction model training apparatus is configured as:
Entity, attribute value are obtained from structured message;
In the relational sentence comprising above-mentioned entity, attribute value, above-mentioned entity, attribute value are labeled as NP, it will be above-mentioned Word between entity, attribute value is labeled as relative REL;
By the relational sentence inputting of the term vector form with mark into Relation extraction model, it is trained.
15. the equipment as described in note 13 or 14, the Relation extraction model training apparatus are further configured to:
NP identification and mark are carried out to the relational sentence for not including above-mentioned entity, attribute value;
By the relational sentence inputting of the term vector form with mark into Relation extraction model, it is trained.
16. the equipment as described in note 11, wherein the relationship judgment means are further configured to:
If the relatival ratio in word between described two adjacent NP is greater than predetermined threshold, it is judged as two phases There are relationships between adjacent NP;
If the relatival ratio in word between described two adjacent NP is less than or equal to predetermined threshold, it is judged as Relationship is not present between two adjacent NP.
17. the equipment as described in note 11, further includes: result generating means are configured as:
Being judged as between two adjacent NP that there are in the case where relationship, extract triple { in described two adjacent NP First NP;Word between two adjacent NP;Second NP in described two adjacent NP }.
18. the equipment as described in note 11, further includes: result generating means are configured as:
Being judged as between two adjacent NP that there are in the case where relationship, extract triple { in described two adjacent NP First NP;The relative in word between two adjacent NP;Second NP in described two adjacent NP }.
19. the equipment as described in note 13, wherein relational sentence and structured message are obtained from encyclopaedia, the packet Relational sentence containing above-mentioned entity, attribute value is first relational sentence comprising above-mentioned entity, attribute value in encyclopaedia.
20. the equipment as described in note 11, wherein Relation extraction model includes sequence labelling model, disaggregated model.

Claims (10)

1. a kind of information extraction method, comprising:
Noun phrase NP identification and mark are carried out to sentence;
Each word in sentence is converted into term vector;
For the sentence at least two NP, the term vector of the word near the word between two adjacent NP and mark are input to It whether is relative with the word between the described two adjacent NP of determination in Relation extraction model;And
According to relative definitive result, judge between described two adjacent NP with the presence or absence of relationship.
2. each word in sentence is the method for claim 1, wherein converted into term vector using transformation model, Middle transformation model is based on the training of relational sentence and obtains.
3. the method for claim 1, wherein training obtains Relation extraction model as follows:
Entity, attribute value are obtained from structured message;
In the relational sentence comprising above-mentioned entity, attribute value, above-mentioned entity, attribute value are labeled as NP, by above-mentioned entity, Word between attribute value is labeled as relative REL;
The term vector for the word being labeled as near the word of REL and mark are input in Relation extraction model, are trained.
4. the method for claim 1, wherein training obtains Relation extraction model as follows:
Entity, attribute value are obtained from structured message;
In the relational sentence comprising above-mentioned entity, attribute value, above-mentioned entity, attribute value are labeled as NP, by above-mentioned entity, Word between attribute value is labeled as relative REL;
By the relational sentence inputting of the term vector form with mark into Relation extraction model, it is trained.
5. the method as claimed in claim 3 or 4, further includes:
NP identification and mark are carried out to the relational sentence for not including above-mentioned entity, attribute value;
By the relational sentence inputting of the term vector form with mark into Relation extraction model, it is trained.
6. the method for claim 1, wherein according to relative definitive result, judge be between described two adjacent NP It is no that there are relationships to include:
If the relatival ratio in word between described two adjacent NP is greater than predetermined threshold, it is judged as two adjacent NP Between there are relationships;
If the relatival ratio in word between described two adjacent NP is less than or equal to predetermined threshold, it is judged as two Relationship is not present between adjacent NP.
7. the method as described in claim 1, further includes:
Being judged as between two adjacent NP that there are in the case where relationship, extract triple { in described two adjacent NP One NP;Word between two adjacent NP;Second NP in described two adjacent NP }.
8. the method as described in claim 1, further includes:
Being judged as between two adjacent NP that there are in the case where relationship, extract triple { in described two adjacent NP One NP;The relative in word between two adjacent NP;Second NP in described two adjacent NP }.
9. the method as claimed in claim 3 or 4, wherein relational sentence and structured message are obtained from encyclopaedia.
10. a kind of information extraction equipment, comprising:
It identifies annotation equipment, is configured as: noun phrase NP identification and mark are carried out to sentence;
Term vector conversion equipment, is configured as: each word in sentence is converted to term vector;
Relative determining device, is configured as: for the sentence at least two NP, near the word between two adjacent NP Word term vector and mark be input in Relation extraction model, whether be relationship with the word between the described two adjacent NP of determination Word;And
Relationship judgment means, are configured as: according to relative definitive result, judging between described two adjacent NP with the presence or absence of pass System.
CN201710350902.1A 2017-05-17 2017-05-17 Information extraction method and information extraction equipment Pending CN108959286A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710350902.1A CN108959286A (en) 2017-05-17 2017-05-17 Information extraction method and information extraction equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710350902.1A CN108959286A (en) 2017-05-17 2017-05-17 Information extraction method and information extraction equipment

Publications (1)

Publication Number Publication Date
CN108959286A true CN108959286A (en) 2018-12-07

Family

ID=64462756

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710350902.1A Pending CN108959286A (en) 2017-05-17 2017-05-17 Information extraction method and information extraction equipment

Country Status (1)

Country Link
CN (1) CN108959286A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051365A (en) * 2020-12-10 2021-06-29 深圳证券信息有限公司 Industrial chain map construction method and related equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446944A (en) * 2008-12-10 2009-06-03 苏州大学 Method for constructing and comparing semantic relation tree for natural language sentences
CN102207936A (en) * 2010-03-30 2011-10-05 国际商业机器公司 Method and system for indicating content change of electronic document
US20150261850A1 (en) * 2014-03-17 2015-09-17 NLPCore LLC Corpus search systems and methods
CN106649275A (en) * 2016-12-28 2017-05-10 成都数联铭品科技有限公司 Relation extraction method based on part-of-speech information and convolutional neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446944A (en) * 2008-12-10 2009-06-03 苏州大学 Method for constructing and comparing semantic relation tree for natural language sentences
CN102207936A (en) * 2010-03-30 2011-10-05 国际商业机器公司 Method and system for indicating content change of electronic document
US20150261850A1 (en) * 2014-03-17 2015-09-17 NLPCore LLC Corpus search systems and methods
CN106649275A (en) * 2016-12-28 2017-05-10 成都数联铭品科技有限公司 Relation extraction method based on part-of-speech information and convolutional neural network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051365A (en) * 2020-12-10 2021-06-29 深圳证券信息有限公司 Industrial chain map construction method and related equipment

Similar Documents

Publication Publication Date Title
Zhang et al. Automated information transformation for automated regulatory compliance checking in construction
CN106776936B (en) Intelligent interaction method and system
CN102262632B (en) Method and system for processing text
Nguyen et al. Supervised machine learning and active learning in classification of radiology reports
CN101236609B (en) Apparatus and method for analyzing and determining correlation of information in a document
CN107315737A (en) A kind of semantic logic processing method and system
US9224103B1 (en) Automatic annotation for training and evaluation of semantic analysis engines
CN105095665A (en) Natural language processing method and system for Chinese disease diagnosis information
CN110008309A (en) A kind of short phrase picking method and device
CN105138829A (en) Natural language processing method and system for Chinese diagnosis and treatment information
Loureiro et al. Medlinker: Medical entity linking with neural representations and dictionary matching
CN114997288A (en) Design resource association method
Zhang et al. Sequence-to-sequence pre-training with data augmentation for sentence rewriting
Kim et al. Automatic annotation of bibliographical references in digital humanities books, articles and blogs
CN111091009A (en) Document association auditing method based on semantic analysis
CN109190112B (en) Patent classification method, system and storage medium based on dual-channel feature fusion
D’Abrera et al. A formally verified cut-elimination procedure for linear nested sequents for tense logic
CN108959286A (en) Information extraction method and information extraction equipment
Aumiller et al. Online dateing: a web interface for temporal annotations
CN116127979B (en) Named entity name standardization method and device, electronic equipment and storage medium
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
Luca et al. Mapping across relational domains for transfer learning with word embeddings-based similarity
Hao et al. Product named entity recognition for Chinese query questions based on a skip-chain CRF model
Anthonysamy et al. Inferring semantic mapping between policies and code: the clue is in the language
Hokamp Deep interactive text prediction and quality estimation in translation interfaces

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181207

WD01 Invention patent application deemed withdrawn after publication