CN109062894A - The automatic identification algorithm of Chinese natural language Entity Semantics relationship - Google Patents

The automatic identification algorithm of Chinese natural language Entity Semantics relationship Download PDF

Info

Publication number
CN109062894A
CN109062894A CN201810796558.3A CN201810796558A CN109062894A CN 109062894 A CN109062894 A CN 109062894A CN 201810796558 A CN201810796558 A CN 201810796558A CN 109062894 A CN109062894 A CN 109062894A
Authority
CN
China
Prior art keywords
entity
text
relationship
natural language
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810796558.3A
Other languages
Chinese (zh)
Inventor
于立洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Yuanchenyuyi Software Technology Co Ltd
Original Assignee
Nanjing Yuanchenyuyi Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Yuanchenyuyi Software Technology Co Ltd filed Critical Nanjing Yuanchenyuyi Software Technology Co Ltd
Priority to CN201810796558.3A priority Critical patent/CN109062894A/en
Publication of CN109062894A publication Critical patent/CN109062894A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses the automatic identification algorithms of Chinese natural language Entity Semantics relationship." entity relationship " trained text is extracted first from the primitive nature language text of input, and it is stored in " entity relationship " trained text library, then text is read from the library, extract entity sets, pick out related entities pair, construct " entity relationship " sentence, and it is stored in training " entity relationship " statement library, each sentence in " entity relationship " statement library is manually marked, machine learning is carried out to " entity relationship " statement library after mark and is modeled, so far " entity relationship " identification model is established.The present invention also proposes a kind of automatic identification algorithm using above-mentioned Chinese natural language Entity Semantics relationship to the algorithm of given Chinese natural language text generation " entity relationship " triple.Automatically the algorithm that learns the present invention is based on machine recognizes and constructs the relationship between entity, break through and avoid Chinese knowledge mapping can only searching structure data limitation.

Description

The automatic identification algorithm of Chinese natural language Entity Semantics relationship
Technical field
The invention belongs to the identifications of natural language and machine learning techniques field, and in particular to a kind of Chinese natural language is real The automatic identification algorithm of body semantic relation.
Background technique
In recent years, with the development of internet, the situation of explosive growth is presented in network data content.Due in internet Extensive, the heterogeneous feature polynary, institutional framework is loose held, effectively obtains information to people and knowledge proposes challenge.Know Map (Knowledge Graph) is known with its powerful semantic processing ability and open organizational capacity, for knowing for Internet era Knowledgeization tissue and intelligent use are laid a good foundation.
Particularly, knowledge mapping is intended to describe various entities (concept) and its relationship present in real world, in turn A huge semantic network figure is constituted, with node presentation-entity (concept) in figure, side is then made of attribute or relationship.Present Knowledge mapping has been used to refer to various large-scale knowledge bases.
The building of extensive knowledge mapping causes enough as the starting of knowledge mapping in academia and industry Attention.Wherein, knowledge extractive technique is then the first step of knowledge mapping building.And knowledge extractive technique is often required that from some The knowledge elements such as entity, relationship, attribute are extracted in disclosed, non-structured text.
In the building of Chinese knowledge mapping, non-structured text often appears as Chinese natural language text.In this way, The understanding of Chinese natural language is just at the important tool for constructing Chinese knowledge mapping.Up to the present, in Chinese natural language Understand that aspect has been achieved for many achievements.Such as the automatic word segmentation of Chinese natural language, part-of-speech tagging, syntactic analysis, entity Extract etc., it can all be supported there are many software both at home and abroad.Although these technologies are from largely strengthening Chinese knowledge graph The building of spectrum, but up to the present, the relationship between entity that how to recognize be still in Chinese natural language understanding one do not have There is the critical issue of solution, and hinders the key technology of Chinese knowledge mapping building.
For a further understanding of this key technology, need to understand first the concept of entity in knowledge mapping.In knowledge graph In spectrum, entity can be an existent true, such as a people, a book, a building etc..Meanwhile Entity is also possible to an abstract concept, such as Marxism.The handling implement of Chinese natural language can be from Entity is recognized in Chinese natural language text, these entities that can be recognized include people, time, place, tissue etc..But The handling implement of Chinese natural language has no idea to recognize the relationship between these entities, and the discrimination of the relationship between entity is Construct the key link of Chinese knowledge mapping.
Such as in the text of a Chinese natural language, " U.S.'s tennis is picked out using the handling implement of natural language Open championship " (event) and " New York " (place) the two entities, still " U.S. Open Tennis " and " New York " the two entities It is that how associated can not but learn.In fact U.S. Open Tennis is carried out in New York.For another example, by Chinese natural Language tool, which picks out Roger Federer, is the name of a people, while picking out the name that Shanghai is a city, but natural language Speech tool can not pick out the relationship of Roger Federer Yu this city of Shanghai.In fact, the relationship in Roger Federer and Shanghai is Roger Federer Come Shanghai and participates in annual Masters Cup's tennis open competition.Up to the present, there are no abilities for the understanding of Chinese natural language Tell these relationships, but these exact for being very important for building knowledge mapping.
To sum up for example because have no ability to identification entity between relationship, on the basis of such knowledge mapping The application system come, such as artificial intelligence and automatic answering system are erected, system capability is just greatly fettered.If The problem of user is " which city Roger Federer went to take part in game ", and the knowledge mapping established, which just has no ability to answer this, asks Topic, although it is it can be seen that carry out Roger Federer and Shanghai, New York are related, absolutely not ability is picked out and these cities for it The associated concrete reason in city.
Based on above-mentioned difficulty and limitation, when establishing Chinese knowledge mapping, industry avoids carrying out entity relationship Extraction.For example " Baidu's knowledge mapping " (being created by Baidu) establishes on carrying out the data that structured data searching is harvested, And the search without unstructured data (natural language text).Another famous " search dog knowledge mapping ", and it is same Searching structure data, and avoid the search of unstructured data.
For the knowledge mapping based on English, the entity relation extraction of early stage mainly passes through manual construction semanteme The method of rule and template identifies entity relationship.These methods need a large amount of manual intervention, excessively cumbersome, and not clever enough It is living.Then, the relational model between entity is gradually instead of artificial predefined grammer and rule, but there is still a need for definition in advance is real Relationship type between body.In recent years, the information extraction frame towards open field (Open Information Extraction, OIE) become main research direction, itself experienced open entity relation extraction and the entity relationship based on joint reasoning The different stages and achievement such as extraction, but up to the present, this method is proved and is not suitable for Chinese natural language text The extraction of entity relationship.
In conclusion the extraction for solving entity relationship in Chinese natural language text is the building field of Chinese knowledge mapping Urgent problem to be solved.
Summary of the invention
The present invention is intended to provide a kind of can effectively distinguish in Chinese natural language text the novel of relationship between entity Algorithm, the algorithm combine existing machine learning and Chinese natural language understanding newest fruits, provide reliable identification, Avoid Chinese knowledge mapping can only searching structure data limitation, thus for establish Chinese knowledge mapping open it is new can Energy.
To achieve the above object, the technical solution adopted by the present invention is distinguishing automatically for Chinese natural language Entity Semantics relationship Know algorithm, specifically includes the following steps:
S1: input primitive nature language text;
S2: extracting " entity relationship " trained text from primitive nature language text, and deposit " entity relationship " training is used Text library;
S3: text is read from " entity relationship " training text library;
S4: entity sets are extracted from text;
S5: picking out related entities pair, wins its relational statement, constructs " entity relationship " sentence;
S6: " entity relationship " sentence constructed is stored in training " entity relationship " statement library;
S7: if each text has been read, each sentence in " entity relationship " statement library is manually marked; Otherwise return step S3;
S8: machine learning is carried out to " entity relationship " statement library after mark and is modeled;
S9: " entity relationship " identification model is established.
Entity sets are extracted from text to improve efficiency, in above-mentioned steps 4 can be used existing Chinese natural language Processing software extracts all Chinese entity sets.
As a standard, identification two entities of related entities clock synchronization can become related entities pair in above-mentioned steps 5 Condition is that the two has to appear in the same sentence.
Construction " entity relationship " sentence described in above-mentioned steps 5 refers specifically to remove entity and retains every other content.
Preferably, being manually labeled as that the semantic pass manually marked is added at the end of each sentence described in step 7 System.
It is carried out preferably, machine learning described in step 8 can choose using bayesian algorithm or selection SVM.
The present invention also proposes a kind of automatic identification algorithm using above-mentioned Chinese natural language Entity Semantics relationship to given Chinese natural language text generation " entity relationship " triple algorithm, specifically includes the following steps:
S21: input primitive nature language text;
S22: calling " text type " to recognize Model Distinguish text type, generates text type triple;
S23: entity sets are extracted from text;
S24: picking out related entities, wins its relational statement, constructs " entity relationship " sentence;
S25: calling " entity relationship " to recognize Model Distinguish entity relationship, generates entity relationship triple;
S26: the triple sentence of all generations is collected.
Wherein, calling " text type " described in above-mentioned steps 22 recognizes Model Distinguish text type, generates text type Triple, specifically includes the following steps:
S31: input primitive nature language text collection;
S32: extracting " text type " trained text, is stored in " text type " trained text library;
S33: its type is manually marked to each text;
S34: the training text set library after forming mark;
S35: it carries out machine learning and models;
S36: it completes " text type " and recognizes model.
Compared with prior art, the present invention has the advantage that
1, the present invention proposes a relationship for being recognized and being constructed based on the algorithm that machine learns automatically between entity.This hair In bright, the process of machine learning is to establish the process of identification model with " entity relationship " sentence by analyzing all training , since these training are closed with " entity relationship " sentence in the semanteme accurately expressed between entity after manually marking The identification, classification model that system, in this way training generate is facing an entity clock synchronization having never seen, and can be distinguished with reliable Knowledge and magnanimity judge the strange entity to most possible semantic relation.
2, current experimental result confirms the validity and scalability of above-mentioned algorithm, has filled up Chinese natural language reality The blank of body Relation extraction.
3, inventive algorithm proposition break through and avoid Chinese knowledge mapping can only searching structure data limitation, from It and is to establish Chinese knowledge mapping to open new possibility.
Detailed description of the invention
Fig. 1, which is " text type ", recognizes model generation process schematic;
Fig. 2, which is " entity relationship ", recognizes model generation process schematic;
Fig. 3 is the process schematic that " entity relationship " triple is generated to given natural language text.
Specific embodiment
Explanation in detail is done further to the present invention now in conjunction with attached drawing.
The present invention provides the algorithms that one kind can effectively distinguish relationship between entity in Chinese natural language text, should Algorithm working principle is as described below.
Algorithm input: a large amount of Chinese natural language text.Only one theme of each text, for example building is described Tian An-men then only describes Tian An-men;As soon as if description personage, only describes this personage etc..Such as encyclopaedia text, just It is the text for meeting such condition.
Algorithm output: largely meet triple (the Resource Description of international semantic web standards Framework, RDF) structural data.These triple sentences effectively describe between different entity and entity Relationship.Ontology when constructing triple sentence selects schema.org the most general in the world (for Google, to push away institute of top grade company Using), but can also be specified by user.
The realization design and principle logical description of algorithm are as shown in Figure 1, 2, 3.
Fig. 1 describes the generation process of " text type " identification model.The input of the process be subject it is single it is original from Right language text collection, a part therein are extracted, the training text as " text type ".Text is used in these training It is stored into " text type " trained text library.Then, artificial to mark " text type " trained text under the guidance of expert The type of each training text in library.Training text set after the completion of artificial mark, after forming mark.At this point, using machine The method of study reads the training text set after mark, carries out machine learning, as a result, " text type " identification model It establishes.
Fig. 2 describes the generation process of " entity relationship " identification model.The input of the process is similarly the single original of subject Beginning natural language text collection, a part therein are extracted, the training text as " entity relationship ".These training are used Text is stored into " entity relationship " trained text library, and then, text trained for each of the library carries out following Operation:
One, current training text is read;
Two, all Chinese entity sets are extracted from the training text;
Three, in the entity sets extracted, all related entities pair are picked out.To every a pair of of related entities pair, Its relational statement is won, and constructs " entity relationship " sentence;
Four, " entity relationship " sentence constructed is stored in training " entity relationship " statement library;
Above-mentioned operation carries out " entity relationship " training with each of text library text, as a result, construction generates (huge) training " entity relationship " statement library.This moment, under the guidance of expert, " entity relationship " is used in artificial mark training The specific entity relationship of each sentence in statement library." entity relationship " statement library after the completion of artificial mark, after forming mark. At this point, reading " entity relationship " statement library after mark with the method for machine learning, machine learning is carried out, as a result, " entity The foundation of relationship " identification model.
So far, the numerous text types of a covering, description wherein each Chinese natural language entity relationship have been obtained Two kernel models, i.e., " text type " identification model and " entity relationship " recognize model.Now, given for any one Chinese natural language text, utilize the two kernel models, so that it may the subject type of the given text is extracted with machine, All entities and prior information, i.e. semantic relation between these entities in the given text.This process is by Fig. 3 In algorithm specifically describe.
Particularly, for any one given Chinese natural language text, the first step shown in Fig. 3 is to call simultaneously And operation " text type " recognizes model, to judge that the basic semantic type of the theme of the natural language text (is worth reaffirming It is only one theme of each text, for example describes building Tian An-men, then only describes Tian An-men;If describing a people Object just only describes this personage etc.).The basic semantic type obtained after model running will be with RDF triple Form is recorded, and is temporary in machine memory.
In Fig. 3, step below is mainly used for the extraction of entity relationship in text.Firstly, machine is from the given text In extract all Chinese entity sets and pick out all related entities pair from the entity sets that these are extracted.It is right The related entities pair that every a pair picks out win its relational statement, and construct corresponding " entity relationship " sentence.It should " entity pass System " sentence is used to call and run " entity relationship " identification model, thus between extracting given related entities as input Semantic relation.Relationship between the results expression of model running related entities, which also will be with the shape of RDF triple sentence Formula is recorded, and is temporary in machine memory.
Finally, calling of the related entities that all ought be picked out to all completion models, and extract the corresponding language of entity pair After adopted relationship, algorithm shown in Fig. 3 will do it final step: collecting the triple sentence of all generations, and is stored in relevant number According in library.
Extraction process described in Fig. 3 is illustrated by taking any given natural language text as an example.In reality In use, having a large amount of natural language text as inputting, algorithm described in Fig. 3 will be used in what each was inputted one by one On text, to generate a large amount of RDF triple sentence.The core that these RDF triple sentences form knowledge mapping is constituted Element a, in this way, knowledge mapping that can express relationship between entity and entity is just successfully constructed.
The blank of Chinese natural language entity relation extraction has been filled up in invention of the invention, is also greatly promoted at the same time The foundation of Chinese knowledge mapping is based especially on the foundation of the knowledge mapping of Chinese natural language text.
As it was noted above, industry is in foundation due to the algorithm for lacking Chinese natural language entity relation extraction at present When literary knowledge mapping, basic solution is the extraction for avoiding carrying out natural language entity relationship.Such as " Baidu's knowledge Map " is built upon in the data that searching structure data are harvested, and searching without unstructured data (natural language) Rope." search dog knowledge mapping " and same searching structure data, and avoid the search of unstructured data.
For another example described previously, the knowledge mapping based on English uses the information extraction frame towards open field in recent years (Open Information Extraction, OIE) extracts the semantic relation between entity, but up to the present, this Method is not suitable for the extraction of Chinese natural language text entities relationship.
Specific embodiment description:
In the following description, it is assumed that there is a number of Chinese natural language text, for example shares 10,000 text, These texts are referred to as " urtext collection ".For convenience of description, it is assumed that the urtext collection is related to following classification: personage, builds at event Build object, country.More types, can with and so on, equally applicable description here.
Algorithm described in Fig. 1 may be implemented as follows:
One, it is concentrated from urtext and randomly selects 100 texts in relation to personage, 100 texts in relation to event, 100 Text in relation to building, 100 the countries concerned text;
Two, such one 400 texts are obtained, form " text type " trained text library;
Three, under the guidance of expert, the artificial type for marking each text in " text type " trained text library, tool Body is got on very well:
Text for each piece about people marks it manually as schema:Person
Text for each piece about event marks it manually as schema:Event
For each piece about the text about building, it is marked manually as schema:CivicStructure
Text for each piece about country marks it manually as schema:Country;
Four, the training text set after the completion of above-mentioned artificial mark, after forming mark;
Five, the training text set after mark is read with the method for machine learning, carries out machine learning.Here it can choose Different learning algorithms, such as Bayes classifier, SVM etc.;
Six, machine learning the result is that " text type " recognizes model, which is stored in lasting medium for use.
Algorithm described in Fig. 2 is main contents of the invention, and details may be implemented as follows:
When describing the algorithm realization in Fig. 2, by (the above-mentioned training text set by taking training text set as above as an example It is equal to " entity relationship " trained text library), and will be using personage as concrete type.Other kinds of realization can be with such It pushes away.
It is assumed that this article is about Deng Jiaxian now from " entity relationship " training with an article is extracted in text library The article of (personage):
Deng Jiaxian (1924-1986), Jiushan Association member, academician of the Chinese Academy of Sciences, famous nuclear physicist, Chinese core The pioneer and founder of weapon development work are made that significant contribution for Chinese nuclear weapon, atomic research and development.Nineteen twenty-four It is born in the family of one scholarly family in Anhui Huaining County.Nineteen thirty-five is admitted to will at middle school, during reading is gone to school, deeply by patriotic The influence of national salvation movement.After nineteen thirty-seven Beijing falls into enemy hands, he is once secret to participate in anti-Japanese party.Afterwards in the case where father Deng is with the arrangement stung, He goes to Kunming with elder sister, and department of physics of the National Southwestern Associated University is admitted in nineteen forty-one.To nineteen fifty, he is in U.S. Pu Du within 1948 University studies abroad, and obtains study abroad in Europe degree, graduation current year, he just comes back home resolutely.Deng sows Chinese Development of Atomic Weapons and hair before this Main Tissues person, the leader of exhibition, Deng Jiaxian is always in the First Line of Chinese weapon manufacture, leader many scholars and technology people Member, successfully devises Chinese atom bomb and hydrogen bomb, and China's national defence self-protection arms has been led to advanced international standard.Deng Jiaxian In one experiment, by nuclear radiation, the carcinoma of the rectum is suffered from, is died on July 29th, 1986 in Beijing, 62 years old throughout the year.
It is as follows that " entity relationship " recognizes the step of model generates process:
One, it first has to extract all Chinese entity sets from " Deng Jiaxian ",
Can be used existing Chinese natural language processing software extracted from this article about Deng Jiaxian it is all Chinese entity sets.For example, the entity that can be extracted includes the following: Deng Jiaxian (personage), Anhui Huaining County (place), southwest Department of physics of associated university (tissue), Purdue Univ-West Lafayette USA's (tissue), Beijing (place) etc..
Two, in the entity sets extracted, all related entities pair are picked out.To every a pair of of related entities pair, Its relational statement is won, and constructs " entity relationship " sentence.
As example, related entities that can be discernable are to there is these: Deng sows first Anhui Huaining County, and Deng sows first southwest connection College Physics system, Deng Jiaxian Purdue Univ-West Lafayette USA are closed, Deng sows first Beijing.The condition that two entities can become related entities pair be They have to appear in the same sentence.It is all related entities pair above based on this standard." U.S. Pu Du is big Learn Beijing " it is not related entities pair, because they are never appeared in a sentence simultaneously.
In the following, will be to each related entities to winning its relational statement, and construct " entity relationship " sentence.With related real Body to " Deng sow first Anhui Huaining County " for, the entity to appearing in this following sentence,
Nineteen twenty-four is born in the family of one scholarly family in Anhui Huaining County
Remove entity, retain every other content, is exactly desired " entity relationship " sentence:
Nineteen twenty-four is born in the family of a scholarly family
Sentence above eliminates entity part.It is above-mentioned " real by the processing of existing Chinese natural language processing software Body relationship " sentence is expressed in this way:
Nineteen twenty-four | birth | in | one | scholarly family | | family
So far, from related entities to " Deng sows first Anhui Huaining County ", constructs its correspondence " entity relationship " sentence.In the following, how description constructs the entity pair again by taking related entities are to " department of physics of the National Southwestern Associated University Deng Jiaxian " as an example It is corresponding " entity relationship " sentence.
Entity appears in this following sentence " department of physics of the National Southwestern Associated University Deng Jiaxian ",
And department of physics of the National Southwestern Associated University is admitted in nineteen forty-one
Remove entity, retain every other content, is exactly desired " entity relationship " sentence:
And it is admitted in nineteen forty-one
Sentence above eliminates entity part.It is above-mentioned " real by the processing of existing Chinese natural language processing software Body relationship " sentence is expressed in this way:
And | in | nineteen forty-one | it is admitted to
As the last one example, come analysis entities to " Deng sow first Beijing ".The analysis of other entity relationships can be complete According to same step, no longer it is described in detail.
Entity to " Deng sow first Beijing " appear in this following sentence,
It dies on July 29th, 1986 in Beijing
Remove entity, retain every other content, is exactly desired " entity relationship " sentence:
It is dying on July 29th, 1986
Sentence above eliminates entity part.It is above-mentioned " real by the processing of existing Chinese natural language processing software Body relationship " sentence is expressed in this way:
In | 1986 | July | 29 days | | it is unfortunate | it dies
So far, three pairs of entities pair are analyzed, following three " entity relationship " sentence has been obtained:
Nineteen twenty-four | birth | in | one | scholarly family | | family
And | in | nineteen forty-one | it is admitted to
In | 1986 | July | 29 days | | it is unfortunate | it dies
Three, " entity relationship " sentence constructed is stored in training " entity relationship " statement library
Seen with above-mentioned example, " entity relationship " these at least following sentences of statement library,
Nineteen twenty-four | birth | in | one | scholarly family | | family
And | in | nineteen forty-one | it is admitted to
In | 1986 | July | 29 days | | it is unfortunate | it dies
In fact, a text can include many entities pair, to also can produce very much " entity relationship " sentence.
Four, aforesaid operations are repeated, i.e., aforesaid operations are carried out with each of text library text to " entity relationship " training, As a result, construction produces huge training " entity relationship " statement library
Five, under the guidance of expert, the specific reality of each sentence in artificial mark training " entity relationship " statement library Body relationship.
By taking three sentences above as an example, available following mark:
Nineteen twenty-four | birth | in | one | scholarly family | | family [schema:birthplace]
And | in | nineteen forty-one | it is admitted to [schema:alumniOf]
In | 1986 | July | 29 days | | it is unfortunate | die [schema:deathplace]
At the end of each sentence above, the semantic relation manually marked is added into.Here example is to use Schema.org is as ontology (ontology).In different applications, user, which can choose, is more suitable oneself ontology.
Six, after the completion of artificial mark as described above, " entity relationship " statement library after being marked.At this point, using machine The method of device study reads " entity relationship " statement library after mark, carries out machine learning, as a result, " entity relationship " recognizes The foundation of model.Here it is possible to select using bayesian algorithm, SVM can also be selected to carry out specific machine learning.
Now, there is above specific implementation, so that it may which carrying out Relation extraction to a unknown text, (unknown text takes From in urtext collection, but not in the column of training text).This process has a detailed description in Fig. 3, specific with one here Example explanatory diagram 3 in algorithm specific implementation.By taking one describes the natural language text of people as an example, other texts can be by this Reasoning.
It is assumed that this article being not comprised in training text library is about Chen Jingrun, content is as follows:
Chen Jingrun, on May 22nd, 1933 are born in Fujian Foochow, Modern mathematics man.Nineteen fifty-three September is assigned to Beijing No.4 Middle School and appoints Religion.2 months nineteen fifty-fives were recommended by principal Mr. Wang Yanan of Xiamen University at that time, were gone back to department of mathematics of Xiamen University of Alma Mater and were appointed assiatant. In October nineteen fifty-seven, due to the appreciation of Hua Luogeng professor, Chen Jingrun is transferred to Chinese Academy of Sciences's Institute of Mathematics.It delivers within 1973 The detailed proof of (1+2) is acknowledged as the major contribution studied Goldbach's Conjecture.In March, 1981 is elected as Chinese section The institute member of Academia Sinica (academician).Zeng Ren State Scientific and Technological Commission Mathematics Discipline group membership.Appoint within 1992 " mathematics journal " chief editor.1996 3 In 10 minutes at 1 point in afternoons of the moon 19, Chen Jingrun is dead in Beijing Hospital, is only 63 years old.
Present purpose seeks to understand this article by machine: first, it is a pass that machine will tell this first In the article of people, second, machine will tell the relationship between the entity for including in article and these entities.These are all The content of extraction will all be expressed by the RDF statement for meeting international semantic criteria, these sentences are also further construction knowledge mapping Basic element.
The first step, using this text as input, call and run " text type " identification model, to judge the nature language Say the basic semantic type of the theme of text
Here, if " text type " recognizes model foundation enough to accurate, it can be assigned to correct type to the text: This is an article about schema:Person, that is, an article about people, and generates following triple simultaneously Sentence:
Ex: Chen Jing profit rdf:type schema:Person.
Second, machine start to extract all Chinese entity sets in " Chen Jingrun ", machine can be used existing Chinese natural language processing software extracts all Chinese entity sets from this article about Chen Jingrun.For example, can be with The entity extracted includes the following: Chen Jingrun (personage), Fujian Foochow (place), Chinese Academy of Sciences's Institute of Mathematics (tissue), Beijing (place) etc..
Three, in the entity sets extracted, machine can pick out all related entities pair.To every a pair of related real Body pair wins its relational statement, and constructs " entity relationship " sentence
Here, the related entities that machine can be discernable are to there is these: Chen Jing moistens Fujian Foochow, and Chen Jing moistens Chinese science Institute's Institute of Mathematics, Chen Jing moisten Beijing.In the following, machine will be to each related entities to winning its relational statement, and construct " real Body relationship " sentence.
By related entities to " Chen Jing moisten Fujian Foochow " for, the entity to appearing in this following sentence,
On May 22nd, 1933 is born in Fujian Foochow
Remove entity, retain every other content, is exactly desired " entity relationship " sentence:
On May 22nd, 1933 is born in
Sentence above eliminates entity part.It is above-mentioned " real by the processing of existing Chinese natural language processing software Body relationship " sentence is expressed in this way:
1933 | May | 22 days | raw | in
Again by related entities to " Chen Jing moisten Chinese Academy of Sciences's Institute of Mathematics " for, the entity to appear in it is following this In a sentence, Chen Jingrun is transferred to Chinese Academy of Sciences's Institute of Mathematics
Remove entity, retains every other content, exactly desired " entity relationship " sentence: be transferred to
Sentence above eliminates entity part.It is above-mentioned " real by the processing of existing Chinese natural language processing software Body relationship " sentence is expressed in this way:
Quilt | it is transferred to
Finally, by related entities to " Chen Jing moisten Beijing " for, the entity is to appearing in this following sentence, Chen Jing Moisten dead in Beijing Hospital
Remove entity, retain every other content, is exactly desired " entity relationship " sentence:
It is dead in hospital
Sentence above eliminates entity part.It is above-mentioned " real by the processing of existing Chinese natural language processing software Body relationship " sentence is expressed in this way:
| hospital | it is dead
Four, machine has generated following " entity relationship " sentence,
Chen Jing profit Fujian Foochow: 1933 | May | 22 days | raw | in
Chen Jing moistens Chinese Academy of Sciences's Institute of Mathematics: quilt | it is transferred to
Chen Jing moistens Beijing: | hospital | it is dead
Wherein first " entity relationship " sentence calls as input and runs " entity relationship " identification model.If mould The accuracy of type is good enough, it should can identify " Chen Jing profit Fujian Foochow " relationship should be following birthplace pass System, which will be recorded as follows in the form of RDF triple sentence:
Ex: Chen Jing profit schema:birthPlace ex: Fujian Foochow
Likewise, second " entity relationship " sentence as input, calls and runs " entity relationship " identification model.Model The relationship that should can identify " Chen Jing profit Chinese Academy of Sciences's Institute of Mathematics " should be the relationship in following place of working, the pass System will be recorded as follows in the form of RDF triple sentence:
Ex: Chen Jing profit schema:workplace ex: Chinese Academy of Sciences Institute of Mathematics
Finally, when third sentence " entity relationship " sentence as input, calls and runs " entity relationship " identification model, model The relationship that should can identify " Chen Jing moistens Beijing " should be his dead place, which will be with RDF triple sentence Form is recorded as follows:
Ex: Chen Jing profit schema:deathplace ex: Beijing
In this way, each entity pair that machine is recognized, the semantic relation between them is just accurately extracted ?.
Five, for this given unknown text, machine obtains following RDF triple sentence,
Ex: Chen Jing profit rdf:type schema:Person.
Ex: Chen Jing profit schema:birthPlace ex: Fujian Foochow
Ex: Chen Jing profit schema:workplace ex: Chinese Academy of Sciences Institute of Mathematics
Ex: Chen Jing profit schema:deathplace ex: Beijing
So far, the machine for just completing entity relationship extracts automatically.As previously described, because lacking Chinese natural language at present The algorithm of entity relation extraction, for industry when establishing Chinese knowledge mapping, basic solution is to avoid carrying out certainly The extraction of right entity language relationship.For example " Baidu's knowledge mapping " is built upon in the data that searching structure data are harvested, And the search without unstructured data (natural language)." search dog knowledge mapping " and same searching structure data, And avoid the search of unstructured data.
Described also as before, the knowledge mapping based on English uses the information extraction frame towards open field in recent years (OpenInformation Extraction, OIE) extracts the semantic relation between entity, but up to the present, this Method is not suitable for the extraction of Chinese natural language text entities relationship.The present invention has filled up Chinese natural language entity relation extraction Blank, be also greatly promoted the foundation of Chinese knowledge mapping at the same time, be based especially on knowing for Chinese natural language text Know the foundation of map.
It should be noted that the description of the above specific embodiment is not intended to limit the invention, it is all in essence of the invention Any modification, equivalent replacement, improvement and so within mind and principle, should all be included in the protection scope of the present invention.

Claims (8)

1. the automatic identification algorithm of Chinese natural language Entity Semantics relationship, which comprises the following steps:
S1: input primitive nature language text;
S2: extracting " entity relationship " trained text from primitive nature language text, is stored in " entity relationship " trained text Library;
S3: text is read from " entity relationship " training text library;
S4: entity sets are extracted from text;
S5: picking out related entities pair, wins its relational statement, constructs " entity relationship " sentence;
S6: " entity relationship " sentence constructed is stored in training " entity relationship " statement library;
S7: if each text has been read, each sentence in " entity relationship " statement library is manually marked;Otherwise Return step S3;
S8: machine learning is carried out to " entity relationship " statement library after mark and is modeled;
S9: " entity relationship " identification model is established.
2. the automatic identification algorithm of Chinese natural language Entity Semantics relationship according to claim 1, it is characterised in that step Extraction entity sets can be used existing Chinese natural language processing software and extract all Chinese entities from text in rapid 4 Set.
3. the automatic identification algorithm of Chinese natural language Entity Semantics relationship according to claim 1, it is characterised in that step Recognized in rapid 5 two entities of related entities clock synchronization can become related entities pair condition both be have to appear in it is same In sentence.
4. the automatic identification algorithm of Chinese natural language Entity Semantics relationship according to claim 1, it is characterised in that step Construction " entity relationship " sentence described in rapid 5 refers specifically to remove entity and retains every other content.
5. the automatic identification algorithm of Chinese natural language Entity Semantics relationship according to claim 1, it is characterised in that step It is manually labeled as that the semantic relation manually marked is added at the end of each sentence described in rapid 7.
6. the automatic identification algorithm of Chinese natural language Entity Semantics relationship according to claim 1, it is characterised in that step Machine learning described in rapid 8 can choose using bayesian algorithm or selection SVM and carry out.
7. a kind of automatic identification algorithm using Chinese natural language Entity Semantics relationship described in claim 1 is in given Literary natural language text generates the algorithm of " entity relationship " triple, which comprises the following steps:
S71: input primitive nature language text;
S72: calling " text type " to recognize Model Distinguish text type, generates text type triple;
S73: entity sets are extracted from text;
S74: picking out related entities, wins its relational statement, constructs " entity relationship " sentence;
S75: calling " entity relationship " to recognize Model Distinguish entity relationship, generates entity relationship triple;
S76: the triple sentence of all generations is collected.
8. the algorithm according to claim 7 to given Chinese natural language text generation " entity relationship " triple, It is characterized in that, step 72 specifically includes the following steps:
S81: input primitive nature language text collection;
S82: extracting " text type " trained text, is stored in " text type " trained text library;
S83: its type is manually marked to each text;
S84: the training text set library after forming mark;
S85: it carries out machine learning and models;
S86: it completes " text type " and recognizes model.
CN201810796558.3A 2018-07-19 2018-07-19 The automatic identification algorithm of Chinese natural language Entity Semantics relationship Pending CN109062894A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810796558.3A CN109062894A (en) 2018-07-19 2018-07-19 The automatic identification algorithm of Chinese natural language Entity Semantics relationship

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810796558.3A CN109062894A (en) 2018-07-19 2018-07-19 The automatic identification algorithm of Chinese natural language Entity Semantics relationship

Publications (1)

Publication Number Publication Date
CN109062894A true CN109062894A (en) 2018-12-21

Family

ID=64817342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810796558.3A Pending CN109062894A (en) 2018-07-19 2018-07-19 The automatic identification algorithm of Chinese natural language Entity Semantics relationship

Country Status (1)

Country Link
CN (1) CN109062894A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032649A (en) * 2019-04-12 2019-07-19 北京科技大学 Relation extraction method and device between a kind of entity of TCM Document
CN110032650A (en) * 2019-04-18 2019-07-19 腾讯科技(深圳)有限公司 A kind of generation method, device and the electronic equipment of training sample data
CN110688857A (en) * 2019-10-08 2020-01-14 北京金山数字娱乐科技有限公司 Article generation method and device
CN111538843A (en) * 2020-03-18 2020-08-14 广州多益网络股份有限公司 Knowledge graph relation matching method, model construction method and device in game field
CN111597812A (en) * 2020-05-09 2020-08-28 北京合众鼎成科技有限公司 Financial field multiple relation extraction method based on mask language model
CN113486189A (en) * 2021-06-08 2021-10-08 广州数说故事信息科技有限公司 Open knowledge graph mining method and system
WO2021253238A1 (en) * 2020-06-16 2021-12-23 Baidu.Com Times Technology (Beijing) Co., Ltd. Learning interpretable relationships between entities, relations, and concepts via bayesian structure learning on open domain facts
CN115827884A (en) * 2022-07-27 2023-03-21 腾讯科技(深圳)有限公司 Text processing method, text processing device, electronic equipment, medium and program product

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN104657750A (en) * 2015-03-23 2015-05-27 苏州大学张家港工业技术研究院 Method and device for extracting character relation
CN108052625A (en) * 2017-12-18 2018-05-18 清华大学 A kind of entity sophisticated category method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN104657750A (en) * 2015-03-23 2015-05-27 苏州大学张家港工业技术研究院 Method and device for extracting character relation
CN108052625A (en) * 2017-12-18 2018-05-18 清华大学 A kind of entity sophisticated category method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姜丽: "面向药品说明书的医疗实体关系抽取方法研究", 《万方数据库》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032649A (en) * 2019-04-12 2019-07-19 北京科技大学 Relation extraction method and device between a kind of entity of TCM Document
CN110032650A (en) * 2019-04-18 2019-07-19 腾讯科技(深圳)有限公司 A kind of generation method, device and the electronic equipment of training sample data
CN110688857A (en) * 2019-10-08 2020-01-14 北京金山数字娱乐科技有限公司 Article generation method and device
CN111538843A (en) * 2020-03-18 2020-08-14 广州多益网络股份有限公司 Knowledge graph relation matching method, model construction method and device in game field
CN111538843B (en) * 2020-03-18 2023-06-16 广州多益网络股份有限公司 Knowledge-graph relationship matching method and model building method and device in game field
CN111597812A (en) * 2020-05-09 2020-08-28 北京合众鼎成科技有限公司 Financial field multiple relation extraction method based on mask language model
WO2021253238A1 (en) * 2020-06-16 2021-12-23 Baidu.Com Times Technology (Beijing) Co., Ltd. Learning interpretable relationships between entities, relations, and concepts via bayesian structure learning on open domain facts
CN113486189A (en) * 2021-06-08 2021-10-08 广州数说故事信息科技有限公司 Open knowledge graph mining method and system
CN115827884A (en) * 2022-07-27 2023-03-21 腾讯科技(深圳)有限公司 Text processing method, text processing device, electronic equipment, medium and program product

Similar Documents

Publication Publication Date Title
CN109062894A (en) The automatic identification algorithm of Chinese natural language Entity Semantics relationship
CN107239446B (en) A kind of intelligence relationship extracting method based on neural network Yu attention mechanism
CN105589844B (en) It is a kind of to be used to take turns the method for lacking semantic supplement in question answering system more
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN106815293A (en) System and method for constructing knowledge graph for information analysis
CN109165385A (en) Multi-triple extraction method based on entity relationship joint extraction model
CN107168945A (en) A kind of bidirectional circulating neutral net fine granularity opinion mining method for merging multiple features
CN107766371A (en) A kind of text message sorting technique and its device
CN108446286A (en) A kind of generation method, device and the server of the answer of natural language question sentence
CN101799849A (en) Method for realizing non-barrier automatic psychological consult by adopting computer
CN103176963B (en) Chinese sentence meaning structure model automatic labeling method based on CRF ++
CN108846104A (en) A kind of question and answer analysis and processing method and system based on padagogical knowledge map
CN106202054A (en) A kind of name entity recognition method learnt based on the degree of depth towards medical field
CN105631018B (en) Article Feature Extraction Method based on topic model
CN107832295B (en) Title selection method and system of reading robot
CN105740227A (en) Genetic simulated annealing method for solving new words in Chinese segmentation
CN105528437A (en) Question-answering system construction method based on structured text knowledge extraction
CN105760514A (en) Method for automatically obtaining short text of knowledge domain from community question-and-answer website
CN106547733A (en) A kind of name entity recognition method towards particular text
CN109918647A (en) A kind of security fields name entity recognition method and neural network model
CN108520038B (en) Biomedical literature retrieval method based on sequencing learning algorithm
CN106649266A (en) Logical inference method for ontology knowledge
CN108090223A (en) A kind of opening scholar portrait method based on internet information
CN104317882B (en) Decision-based Chinese word segmentation and fusion method
CN109359701A (en) A kind of three-dimensional modeling data analytic method of extracted with high accuracy and Fast Classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181221

RJ01 Rejection of invention patent application after publication