CN108520065A - Name construction method, system, equipment and the storage medium of Entity recognition corpus - Google Patents

Name construction method, system, equipment and the storage medium of Entity recognition corpus Download PDF

Info

Publication number
CN108520065A
CN108520065A CN201810325492.XA CN201810325492A CN108520065A CN 108520065 A CN108520065 A CN 108520065A CN 201810325492 A CN201810325492 A CN 201810325492A CN 108520065 A CN108520065 A CN 108520065A
Authority
CN
China
Prior art keywords
entity
chinese
name
wiki
internal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810325492.XA
Other languages
Chinese (zh)
Other versions
CN108520065B (en
Inventor
钱龙华
何云琪
李雁群
王红玲
周国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201810325492.XA priority Critical patent/CN108520065B/en
Publication of CN108520065A publication Critical patent/CN108520065A/en
Application granted granted Critical
Publication of CN108520065B publication Critical patent/CN108520065B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of construction methods of Chinese name Entity recognition corpus, based on computer, using Chinese wikipedia as language material, by the feature for extracting Chinese wikipedia entry, it can classify to Chinese wikipedia entry, determine Chinese Wiki entity entries, and predict the type of the corresponding name entity of Chinese Wiki entity entries, finally, the Chinese Wiki list of entities for including name entity is constructed based on type and redirection information, and Chinese name Entity recognition corpus can be made of all name entities in Chinese Wiki list of entities.Have the advantages that abundant in content and neighborhood covering degree is wide.Moreover, using this construction method, Chinese name Entity recognition corpus can be built automatically based on computer, used manpower and material resources sparingly.In addition, the invention also discloses a kind of structure system of Chinese name Entity recognition corpus, equipment and a kind of computer readable storage medium, effect are as above.

Description

Name construction method, system, equipment and the storage medium of Entity recognition corpus
Technical field
The present invention relates to natural language processing technique field, more particularly to name Entity recognition corpus construction method, System, equipment and storage medium.
Background technology
The purpose of information extraction is to extract entity and its correlation from structureless free text, and be converted into knot Structure expression-form, to provide data basis for the construction of knowledge base.
In the prior art, the research of Chinese name Entity recognition mainly uses the language material of high quality marked manually, such as In January, 1998《People's Daily》Language material, Microsoft Research, Asia's MSRA language materials, City University of Hong Kong's CityU language materials and ACE2005 Chinese language materials etc..The rule and language material scale of name entity class and mark used in different language materials have Institute's difference, and in order to ensure the quality of language material, these language materials all need professional to be labeled, and not only limit language The scale of material and field, it is also desirable to expend a large amount of manpower and materials.For example, in January, 1998 of News Field《The people day Report》Language material, then not only language material content is outmoded, but also when it is applied to other fields in addition to News Field, accuracy is relatively low.
Therefore, how to build automatically a kind of having many advantages, such as abundant in content and wide application field Chinese name Entity recognition Corpus is a technical problem that technical personnel in the field need to solve at present.
Invention content
The object of the present invention is to provide a kind of construction method, system, equipment and the storages of name Entity recognition corpus to be situated between Matter, can build automatically a kind of having many advantages, such as abundant in content and wide application field Chinese name Entity recognition corpus.
In order to solve the above technical problem, the present invention provides it is a kind of Chinese name Entity recognition corpus structure side Method is based on computer, including:
The feature of the Chinese wikipedia entry of extraction, and predict the corresponding life of Chinese Wiki entity entries according to the feature The type of name entity;
Redirection information based on the type and the Chinese Wiki entity entries, structure include the name entity Chinese Wiki list of entities is to constitute corpus;
Wherein, the Chinese Wiki entity entries are to include the Chinese wikipedia entry of the name entity.
Preferably, the feature of the Chinese wikipedia entry of the extraction is specially:
The feature is extracted from the message box, classfying frame and abstract of the Chinese Wiki entity entries.
Preferably, after the structure is comprising the Chinese Wiki list of entities of the name entity, further include:
It identifies the internal entity in the Chinese Wiki list of entities, and generates the nested name comprising the internal entity Entity;
The nested list of entities of Chinese is added in the nested name entity, and is determined in the Chinese nested list of entities Whether each internal entity meets nest relation;
The mark for the first internal entity for meeting the nest relation is removed, it includes to be unsatisfactory for the nest relation to delete The nested name entity of second internal entity.
Preferably, the internal entity in the identification Chinese Wiki list of entities specifically includes:
Determine the name entity without type ambiguity in the Chinese Wiki list of entities;
Using the Chinese Wiki list of entities as dictionary, the name without type ambiguity is identified using longest match principle The internal entity for including in entity.
Preferably, whether each internal entity of the determination meets nest relation and is specially:
Judge the Chinese Wiki entity entries and outside corresponding with the internal entity that the internal entity is directed toward Whether the Chinese Wiki entity entries that entity is directed toward have intersection;
If it is, determining that the internal entity is the first internal entity for meeting the nest relation;
If it is not, then determining that the internal entity is the second internal entity for being unsatisfactory for the nest relation.
Preferably, in the mark of first internal entity for removing and meeting the nest relation, it includes to be unsatisfactory for delete After the nested name entity of second internal entity of the nest relation, further include:
Judge the first external entity in the nested list of entities of Chinese whether be the second external entity internal entity;
If it is, the nested structure of first external entity is converged in second external entity.
In order to solve the above-mentioned technical problem, the present invention also provides it is a kind of Chinese name Entity recognition corpus structure system System is based on computer, including:
Prediction module, the feature for extracting Chinese wikipedia entry, and predict that Chinese Wiki is real according to the feature The type of the corresponding name entity of body entry;
Module is built, for the redirection information based on the type and the Chinese Wiki entity entries, structure includes The Chinese Wiki list of entities of the name entity is to constitute corpus;
Wherein, the Chinese Wiki entity entries are to include the Chinese wikipedia entry of the name entity.
In order to solve the above-mentioned technical problem, the present invention also provides it is a kind of Chinese name Entity recognition corpus structure set It is standby, including:
Memory, for storing construction procedures;
Processor, for realizing any Chinese name Entity recognition as described above when executing the construction procedures The step of construction method of corpus.
In order to solve the above-mentioned technical problem, the present invention also provides a kind of computer readable storage medium, the computers It is stored with construction procedures on readable storage medium storing program for executing, as described above any is realized when the construction procedures are executed by processor The step of construction method of Chinese name Entity recognition corpus.
In terms of existing technologies, the construction method of Chinese name Entity recognition corpus provided by the invention, is based on Computer, can be to Chinese Wiki by extracting the feature of Chinese wikipedia entry as language material using Chinese wikipedia Encyclopaedia entry is classified, and determines Chinese Wiki entity entries, and predicts the corresponding name entity of Chinese Wiki entity entries Type, finally, based on type and redirection information construct comprising name entity Chinese Wiki list of entities, can be in All name entities in literary Wiki list of entities constitute Chinese name Entity recognition corpus.Due to wikipedia be one from By content, open editor and multilingual network encyclopedia collaboration items, a large amount of name entity is covered, has content rich The features such as rich and Covering domain is wide, so, the corpus of the construction method structure of Entity recognition corpus is named using this Chinese, Equally have the advantages that abundant in content and neighborhood covering degree is wide.Moreover, this construction method is based on computer, it can be by corresponding Computer program automatically extracts the feature of Chinese wikipedia entry, automatic Prediction name entity type and automatic structure Chinese dimension Base list of entities can save a large amount of manpower and materials to constitute corpus.It therefore, being capable of automatic structure using this construction method Building a kind of having many advantages, such as abundant in content and wide application field Chinese name Entity recognition corpus.In addition, the present invention also carries A kind of structure system, equipment and a kind of computer readable storage medium of Chinese name Entity recognition corpus are supplied, effect is such as On.
Description of the drawings
In order to illustrate the embodiments of the present invention more clearly, attached drawing needed in the embodiment will be done simply below It introduces, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ordinary skill people For member, without creative efforts, other accompanying drawings are can also be obtained according to these attached drawings.
Fig. 1 is a kind of flow chart of the construction method of Chinese name Entity recognition corpus provided in an embodiment of the present invention;
Fig. 2 is the flow of the construction method of another Chinese name Entity recognition corpus provided in an embodiment of the present invention Figure;
Fig. 3 is the flow of the construction method of another Chinese name Entity recognition corpus provided in an embodiment of the present invention Figure;
Fig. 4 is a kind of composition signal of the structure system of Chinese name Entity recognition corpus provided in an embodiment of the present invention Figure;
Fig. 5 is a kind of composition signal of the structure equipment of Chinese name Entity recognition corpus provided in an embodiment of the present invention Figure.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, rather than whole embodiments.Based on this Embodiment in invention, those of ordinary skill in the art under the premise of not making the creative labor, obtained it is all its His embodiment, belongs to the scope of the present invention.
The object of the present invention is to provide a kind of construction method, system, equipment and the storages of name Entity recognition corpus to be situated between Matter, can build automatically a kind of having many advantages, such as abundant in content and wide application field Chinese name Entity recognition corpus.
In order to make those skilled in the art be better understood from technical solution of the present invention, below in conjunction with the accompanying drawings and it is embodied The present invention is described in further detail for mode.
Fig. 1 is a kind of flow chart of the construction method of Chinese name Entity recognition corpus provided in an embodiment of the present invention. The construction method of text name Entity recognition corpus provided in this embodiment, is based on computer, as shown in Figure 1, the construction method Including:
S10:The feature of the Chinese wikipedia entry of extraction, and predict the corresponding life of Chinese Wiki entity entries according to feature The type of name entity.
Wherein, Chinese wikipedia entry is that the entry of Chinese display is used in wikipedia;Chinese Wiki entity entries For Chinese wikipedia entry corresponding with name entity;Name entity refers to an object with the real world, leads in the text Often it is made of one or more continuous words;The type of name entity can be personage (nr), place name (ns) and tissue (nt) Deng as " [Beijing] ns " indicates that Beijing is a place name entity.
Wikipedia is a free content, open editor and multilingual network encyclopedia collaboration items, is covered A large amount of name entity, has the characteristics that abundant in content and Covering domain is wide, and content is presented in the form of entry one by one, For each entry there are one corresponding Wiki page, the article in Wiki page, which summarizes, contains abundant structuring, half structure Change and non-structured information, such as template, message box, page classifications etc., correlation of these information for natural language processing Research has prodigious utility value.
The feature of Chinese wikipedia entry includes the validity feature excavated from wikipedia and Chinese feature is combined to add The extension feature entered and meaning of a word feature etc., in order to classified to Chinese wikipedia entry according to these features, and The type of the corresponding name entity of the Chinese Wiki entity entries of prediction.It being preferably carried out mode as one kind, extracts Chinese Wiki The feature of encyclopaedia entry is specially:The extraction feature from the message box, classfying frame and abstract of Chinese Wiki entity entries.
For example, comprising entitled " Ma Yun " this name entity in a Chinese Wiki entity entries, in Chinese Wiki reality There are " birth 1964 on September 10, ", " residence China's Mainland ", " nationality China name republicanism in message box in body entry The summary info about main body such as state ", " Hangzhou Pedagogic University of Alma Mater ", " professional director chairman of administration of group of Alibaba ", then may be used Using extraction " name ", " birth ", " residence ", " Alma Mater " and " occupation " etc. as word packet feature;The Chinese Wiki entity entries Classfying frame have " 1964 be born ", " alive personage ", " hundreds of millions rich and powerful people of the People's Republic of China (PRC) ", " group of Alibaba " etc., Centre word " birth ", " personage " and " rich and powerful people " of each classification etc. can then be extracted and be used as feature;The Chinese Wiki entity entries Abstract definition sentence be " Ma Yun (English name Jack Ma, on September 10th, 1964) People's Republic of China (PRC) enterpriser ", then may be used To extract centre word " enterpriser " as feature.
In specific implementation, some can be selected to have marked the Chinese Wiki entity entries of good lot name entity type in advance As the training data of existing grader, grader is trained, obtains disaggregated model.For step S10, Ke Yicong Multiple features are extracted in the message boxes of Chinese Wiki entity entries, classfying frame and abstract, and it is real by this multiple feature to form name Then the feature vector of body recycles existing grader and advance trained disaggregated model to predict feature vector, Obtain the type of corresponding name entity.
S11:Redirection information based on type and Chinese Wiki entity entries, structure include the Chinese Wiki of name entity List of entities is to constitute corpus.
One name entity may have multiple titles, including title and alias therefore can be according to Chinese Wikis The redirection information of entity entries determines title difference, but represents the title of same name entity simultaneously, along with name is real The type of body, it may be determined that a name entity.Because by name entity title and type can determine name entity, So the Chinese Wiki list of entities of the redirection information structure based on type and Chinese Wiki entity entries at least should include life The title and type two of name entity.Also, it is to be understood that in Chinese Wiki list of entities, same name entity Title and type are corresponding.Finally, all name entities being written in Chinese Wiki list of entities constitute corpus.
Whether there is ambiguity to divide name entity according to entity name, respectively the name entity with title ambiguity With the name entity of no title ambiguity;Also, for the name entity for having title ambiguity, and can be according to whether there is type Ambiguity is to there is the name entity of title ambiguity to divide, the name entity for respectively having type ambiguity and the life without type ambiguity Name entity.Wherein, title ambiguity refers to that the same name entity is directed toward two or more Chinese Wiki entity entries, class Type ambiguity refers to that there is the same name entity of title ambiguity to have two or more type.
For example, name entity entitled " Bandung ", it is corresponding Chinese Wiki entity entries have " 22735 " and " 5044266 ", then the name entity there is title ambiguity, if in Chinese Wiki entity entries " 22735 " the name entity class Type is " ORG ", and the type of the name entity is " PER " in Chinese Wiki entity entries " 5044266 ", then the name entity has Type ambiguity.
In Chinese Wiki list of entities, when Chinese Wiki entity row are written in the title and type of naming entity by needs Table.Moreover, name entity can be divided into two kinds, one is the name entities of no type ambiguity, include the name of no title ambiguity Entity and there are title ambiguity but the name entity without type ambiguity;Another kind is the name entity for having type ambiguity.Specific In implementation, name entity for no title ambiguity and there is title ambiguity but without the name entity of type ambiguity, can will name Chinese Wiki list of entities is added in the title of entity and corresponding unique type together;For the name with type ambiguity Chinese Wiki list of entities can be added together by entity for the title for naming entity and corresponding multiple and different type. Also, it is to be understood that the name entity with type ambiguity, certain to have title ambiguity simultaneously.
For example, entitled " Ke Lisidengnaijiate " of name entity, corresponding Chinese Wiki entity entries are " 125 ", type are " PER ", then Chinese Wiki entity name list directly are added in " Ke Lisidengnaijiate PER ". And if entitled " Bandung " of name entity, corresponding Chinese Wiki entity entries have " 22735 " and " 5044266 ", wherein The type of name entity is " ORG " in Chinese Wiki entity entries " 22735 ", is ordered in Chinese Wiki entity entries " 5044266 " The type of name entity is " PER ", then needs " Chinese Wiki entity name list is added in Bandung PER, ORG ".
In conclusion the construction method of Chinese name Entity recognition corpus provided in an embodiment of the present invention, based on calculating Machine, can be to Chinese wikipedia by extracting the feature of Chinese wikipedia entry as language material using Chinese wikipedia Entry is classified, and determines Chinese Wiki entity entries, and predicts the class of the corresponding name entity of Chinese Wiki entity entries Type finally constructs the Chinese Wiki list of entities for including name entity based on type and redirection information, can be tieed up by Chinese All name entities in base list of entities constitute Chinese name Entity recognition corpus.Since wikipedia is in a freedom Hold, open editor and multilingual network encyclopedia collaboration items, cover a large amount of name entity, have it is abundant in content and The features such as Covering domain is wide, so, the corpus of the construction method structure of Entity recognition corpus is named using this Chinese, equally Have the advantages that abundant in content and neighborhood covering degree is wide.Moreover, this construction method is based on computer, corresponding calculating can be passed through It is real that machine program automatically extracts the feature of Chinese wikipedia entry, automatic Prediction name entity type and the Chinese Wiki of automatic structure Body list can save a large amount of manpower and materials to constitute corpus.Therefore, using this construction method, one can be built automatically Kind has many advantages, such as abundant in content and wide application field Chinese name Entity recognition corpus.
The name entity for including in Chinese Wiki list of entities described above includes nested name entity and non-nested life Name entity, and be not distinguish, so, the name entity for including in Chinese Wiki list of entities by mentioned earlier is constituted Corpus also without distinguishing nested name entity and non-nested name entity.And in practical applications, since nesting is named Entity contains the correlation between abundant entity information and entity, and its is complicated changeable, so nested name is real The identification of body is also to be worth one of the task of research in information extraction.
Fig. 2 is the flow of the construction method of another Chinese name Entity recognition corpus provided in an embodiment of the present invention Figure.As shown in Fig. 2, above-described embodiment is based on, as a preferred embodiment, in Chinese dimension of the structure comprising name entity After base list of entities, further include:
S20:Internal entity in the Chinese Wiki list of entities of identification, and it is real to generate the nested name comprising internal entity Body.
S21:Chinese nested list of entities is added in nesting name entity, and is determined each interior in the nested list of entities of Chinese Whether portion's entity meets nest relation.
S22:The mark for the first internal entity for meeting nest relation is removed, it includes to be unsatisfactory for the second of nest relation to delete The nested name entity of internal entity.
It should be noted that nested name entity refers to the name for naming entity nested inside one or more name entity Entity;Inside name entity refers to the name entity being nested in inside nested name entity;External entity refers to being nested in nesting Name the outermost name entity of entity;First internal entity refers to and meeting nest relation in Chinese nested list of entities Internal entity, the second internal entity refers to and being unsatisfactory for the internal entity of nest relation in Chinese nested list of entities.
In the preferred embodiment, it after the internal entity in identifying Chinese Wiki list of entities, can generate Include the nested name entity of internal entity, and the nested list of entities of Chinese is added in nesting name entity, judges Chinese nesting Whether each internal entity in list of entities meets nest relation, so that it is determined that go out in Chinese nested list of entities, in which Portion's entity meets nest relation, which internal entity is unsatisfactory for nest relation, finally retains the nested name for meeting nest relation Entity deletes the nested name entity for being unsatisfactory for nest relation.For the first internal entity for meeting nest relation, directly Remove its mark;And for the second internal entity for being unsatisfactory for nest relation, illustrate currently real comprising the second inside The nested name entity of body can not determine whether really nesting names entity, this is possible be since wikipedia internal chaining lacks Caused by mistake.At this point, it will includes that the nested of the second internal entity names entity to be deleted from Chinese Wiki list of entities then to need. Finally, can Chinese nested name Entity recognition corpus be constituted by the nested name entity in Chinese nested list of entities.
Based on above-described embodiment as a preferred embodiment, identifying the internal entity in Chinese Wiki list of entities It specifically includes:
Determine the name entity without type ambiguity in Chinese Wiki list of entities;
Using Chinese Wiki list of entities as dictionary, is identified in the name entity without type ambiguity and wrapped using longest match principle The internal entity contained.
In the preferred embodiment, in using including in name entity of the longest match principle identification without type ambiguity Portion's entity can promote recognition accuracy.Specifically, longest match principle can be used, no type discrimination is identified from left to right The internal entity for including in the name entity of justice.
For example, name entity is " [Shanghai Communications University Xuhui school district] ns ", include " [Shanghai Communications University] in dictionary Nt " and " [Xuhui] ns " two name entities, can directly obtain nested name entity " [[Shanghai Communications University] nt [Xuhui] ns School district] ns ".
Based on above-described embodiment, as a preferred embodiment, determining whether each internal entity meets nest relation Specially:
Judge in the Chinese Wiki entity entries and external entity corresponding with internal entity direction that internal entity is directed toward Whether literary Wiki entity has intersection;
If it is, determining that internal entity is the first internal entity for meeting nest relation;
If it is not, then determining that internal entity is the second internal entity for being unsatisfactory for nest relation.
The Chinese dimension that the Chinese Wiki entity entries and external entity corresponding with internal entity that internal entity is directed toward are directed toward Base entity entries have the case where intersection to have:
(1) internal entity is without title ambiguity, and the Chinese Wiki entity entries outside corresponding with the internal entity being directed toward The Chinese Wiki entity entries that entity is directed toward are identical.
(2) internal entity has title ambiguity, and a certain Chinese Wiki entity entries being directed toward are corresponding with the internal entity The Chinese Wiki entity entries that external entity is directed toward are identical.
The Chinese dimension that the Chinese Wiki entity entries and external entity corresponding with internal entity that internal entity is directed toward are directed toward Base entity entries do not have the case where intersection to have:
(1) internal entity is without title ambiguity, and the Chinese Wiki entity entries being directed toward do not appear in and the internal entity In the Chinese Wiki page internal chaining chained list of corresponding external entity.
(2) internal entity has a title ambiguity, and any Chinese Wiki entity entries being directed toward do not appear in it is interior with this In the Chinese Wiki page internal chaining chained list of the corresponding external entity of portion's entity.
It should be noted that Chinese Wiki page internal chaining chained list refers to all directions occurred in a Chinese Wiki page The connection of other Wiki pages, Wiki page are corresponding with wikipedia entry.
For example, name entity " [Tibet Autonomous Region] ns " and the same Chinese Wiki of name entity " [Tibet] ns " direction are real Body entry, then the latter cannot function as the former internal entity.In fact, name entity " [Tibet Autonomous Region] ns " be one cannot The entirety divided again, so deleting the mark for removing name entity " [Tibet] ns ".
It is a title ambiguity to name " Hong Kong " in entity " [Hongkong] ns ", the Chinese Wiki of some being directed toward It is identical that entity entries and external entity are directed toward Chinese Wiki entity entries, still, in fact, " [Hongkong] ns " be one not The entirety that can divide again, so removing the mark of " [Hong Kong] ns ".
It names in entity " [Ao Leierweier clarkes] ns " internal chaining list and is not present to name entity " [thunder difficult to understand] ns " Reference, thus nest relation is invalid, and " [Ao Leierweier clarkes] ns " is removed from Chinese nested name list of entities.
It is a title ambiguity, any one pointed Wiki page to name " China " in entity " [Chinatown station] nt " Face is not present in the Wiki page internal chaining list of external entity, and thus " China " is not an internal entity.It will " [Chinatown station] nt " is removed from Chinese nested name list of entities.
It should be pointed out that the internal entity " [73 in name entity " [70 army's war of resistance bed of honour] ns " Army] nt " be directed toward Wiki page do not appear in the Wiki page internal chaining list of external entity.By " [70 army] nt " Removed from Chinese nested name list of entities, still " [70 army] nt " is its really nested name entity, this be by Caused by wikipedia internal chaining missing.
Fig. 3 is the flow of the construction method of another Chinese name Entity recognition corpus provided in an embodiment of the present invention Figure.As shown in figure 3, in order to which the nested internal entity named in entity is marked out to come as far as possible, further to refine nesting The internal structure of entity is named, the accuracy of nested name entity in Chinese name Entity recognition corpus is improved, based on above-mentioned Embodiment, as a preferred embodiment, after step s 22, further including:
S30:Judge the first external entity in Chinese nested list of entities whether be the second external entity internal entity, If it is, S32 is entered step, if not, continuing to judge that next first external entity in Chinese nested list of entities is No is the internal entity of the second external entity, until completing the judgement to all external entities in Chinese nested list of entities.
S31:The nested structure of first external entity is converged in the second external entity.
In this way, can the second outside be converged in fact as the first external entity of the internal entity of the second external entity for this In body, that is to say, that in nested name entity corresponding with the second external entity, using the first external entity as internal entity It marks out and, to achieve the purpose that the nested name entity internal structure of refinement, and then improve Chinese name Entity recognition corpus The accuracy of middle nested name entity.Also, it is to be understood that by will be originally as the internal entity of the second external entity First external entity converges in the second external entity, it is also possible that the data in Chinese nesting list of entities are more succinct. Finally, can Chinese nested name Entity recognition corpus be constituted by the nested name entity in Chinese nested list of entities.
For example, name entity " [[Shanghai] ns university of communications] nt " appears in " [[Shanghai Communications University] nt [Xuhui] schools ns Area] ns " inside, then can pool single nested name entity " [[[Shanghai] ns university of communications] nt [Xuhui] ns School district] ns ".
Above for a kind of embodiment progress of the construction method of Chinese name Entity recognition corpus provided by the invention Detailed description, the present invention also provides a kind of structure systems corresponding with the construction method, due to the implementation of components of system as directed Example and the embodiment of method part mutually correlate, therefore the embodiment of components of system as directed refers to retouching for the embodiment of method part It states, for identical part, repeats no more herein.
Fig. 4 is a kind of composition signal of the structure system of Chinese name Entity recognition corpus provided in an embodiment of the present invention Figure.The structure of Chinese name Entity recognition corpus provided in this embodiment is based on computer, as shown in figure 4, the structure system Including:
Prediction module 40, the feature for extracting Chinese wikipedia entry, and predict Chinese Wiki entity according to feature The type of the corresponding name entity of entry.
Module 41 is built, for the redirection information based on type and Chinese Wiki entity entries, structure is real comprising name The Chinese Wiki list of entities of body is to constitute corpus.
Wherein, Chinese Wiki entity entries are to include the Chinese wikipedia entry of name entity.
Since the structure system of Chinese name Entity recognition corpus provided in this embodiment is ordered with Chinese described above The construction method of name Entity recognition corpus is corresponding, so, Chinese name Entity recognition corpus provided in this embodiment Structure system and the construction method of Chinese name Entity recognition corpus described above have same advantageous effect, herein not It repeats again.
Above for a kind of embodiment progress of the construction method of Chinese name Entity recognition corpus provided by the invention Detailed description, the present invention also provides a kind of structure equipment corresponding with the construction method, due to the implementation of environment division Example and the embodiment of method part mutually correlate, therefore the embodiment of environment division refers to retouching for the embodiment of method part It states, for identical part, repeats no more herein.
Fig. 5 is a kind of composition signal of the structure equipment of Chinese name Entity recognition corpus provided in an embodiment of the present invention Figure.As shown in figure 5, the structure equipment of Chinese name Entity recognition corpus provided in this embodiment includes:
Memory 50, for storing construction procedures;
Processor 51, for realizing any Chinese name Entity recognition language as described above when executing construction procedures The step of expecting the construction method in library.
Since the processor in the structure equipment of Chinese name Entity recognition corpus provided in this embodiment is deposited in calling When the construction procedures stored in reservoir, the structure side of any Chinese name Entity recognition corpus described above can be realized The step of method, so, the structure equipment of Chinese name Entity recognition corpus provided in this embodiment and Chinese described above Name the construction method of Entity recognition corpus that there is same advantageous effect, repeats no more herein.
Above for a kind of embodiment progress of the construction method of Chinese name Entity recognition corpus provided by the invention Detailed description, the present invention also provides a kind of computer readable storage mediums corresponding with the construction method, due to calculating The embodiment of machine readable storage medium storing program for executing part mutually correlates with the embodiment of method part, therefore computer readable storage medium portion The embodiment divided refers to the description of the embodiment of method part, and for identical part, repeats no more herein.
A kind of computer readable storage medium provided in this embodiment, is stored with construction procedures, construction procedures are by processor The step of construction method of any Chinese name Entity recognition corpus as described above is realized when execution.
Since the construction procedures stored on computer readable storage medium provided in this embodiment by processor when being called, The step of capable of realizing the construction method of any Chinese name Entity recognition corpus described above, so, the present embodiment The construction method of the computer readable storage medium of offer and Chinese name Entity recognition corpus described above has same Advantageous effect, repeats no more herein.
Above to construction method, system, equipment and the storage medium of name Entity recognition corpus provided by the present invention It is described in detail.Each embodiment is described by the way of progressive in specification, each embodiment stress be with The difference of other embodiments, just to refer each other for identical similar portion between each embodiment.
It should be pointed out that for those skilled in the art, without departing from the principle of the present invention, Can be with several improvements and modifications are made to the present invention, these improvement and modification also fall into the protection domain of the claims in the present invention It is interior.
It should also be noted that, in the present specification, such as first and second etc relational terms are used merely to one A entity is either operated either to operate with another entity and be distinguished without necessarily requiring or implying these entities or behaviour There are any actual relationship or orders between work.Moreover, the terms "include", "comprise" or its any variant are intended to Cover non-exclusive inclusion, so that the process, method, article or equipment including a series of element includes not only that A little elements, but also include the other elements being not explicitly listed, further include either for this process, method, article or setting Standby intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in the process, method, article or apparatus that includes the element.

Claims (9)

1. a kind of construction method of Chinese name Entity recognition corpus, which is characterized in that it is based on computer, including:
The feature of the Chinese wikipedia entry of extraction, and predict that the corresponding name of Chinese Wiki entity entries is real according to the feature The type of body;
Redirection information based on the type and the Chinese Wiki entity entries, structure include the Chinese of the name entity Wiki list of entities is to constitute corpus;
Wherein, the Chinese Wiki entity entries are to include the Chinese wikipedia entry of the name entity.
2. the construction method of Chinese name Entity recognition corpus according to claim 1, which is characterized in that the extraction The feature of Chinese wikipedia entry is specially:
The feature is extracted from the message box, classfying frame and abstract of the Chinese Wiki entity entries.
3. the construction method of Chinese name Entity recognition corpus according to claim 2, which is characterized in that in the structure It builds after the Chinese Wiki list of entities comprising the name entity, further includes:
Identify the internal entity in the Chinese Wiki list of entities, and it is real to generate the nested name comprising the internal entity Body;
The nested list of entities of Chinese is added in the nested name entity, and determines each institute in the Chinese nested list of entities State whether internal entity meets nest relation;
The mark for the first internal entity for meeting the nest relation is removed, it includes to be unsatisfactory for the second of the nest relation to delete The nested name entity of internal entity.
4. the construction method of Chinese name Entity recognition corpus according to claim 3, which is characterized in that the identification Internal entity in the Chinese Wiki list of entities specifically includes:
Determine the name entity without type ambiguity in the Chinese Wiki list of entities;
Using the Chinese Wiki list of entities as dictionary, the name entity without type ambiguity is identified using longest match principle In include internal entity.
5. the construction method of Chinese name Entity recognition corpus according to claim 4, which is characterized in that the determination Whether each internal entity meets nest relation:
Judge the Chinese Wiki entity entries and external entity corresponding with the internal entity that the internal entity is directed toward Whether the Chinese Wiki entity entries of direction have intersection;
If it is, determining that the internal entity is the first internal entity for meeting the nest relation;
If it is not, then determining that the internal entity is the second internal entity for being unsatisfactory for the nest relation.
6. naming the construction method of Entity recognition corpus according to claim 3-5 any one of them Chinese, which is characterized in that In the mark of first internal entity for removing and meeting the nest relation, delete comprising being unsatisfactory for the of the nest relation After the nested name entity of two internal entities, further include:
Judge the first external entity in the nested list of entities of Chinese whether be the second external entity internal entity;
If it is, the nested structure of first external entity is converged in second external entity.
7. a kind of structure system of Chinese name Entity recognition corpus, which is characterized in that it is based on computer, including:
Prediction module, the feature for extracting Chinese wikipedia entry, and predict Chinese Wiki entity item according to the feature The type of the corresponding name entity of mesh;
Module is built, for the redirection information based on the type and the Chinese Wiki entity entries, structure is comprising described Name the Chinese Wiki list of entities of entity to constitute corpus;
Wherein, the Chinese Wiki entity entries are to include the Chinese wikipedia entry of the name entity.
8. a kind of structure equipment of Chinese name Entity recognition corpus, which is characterized in that including:
Memory, for storing construction procedures;
Processor, for being realized when executing the construction procedures, Chinese name entity is known as described in claim any one of 1-6 The step of construction method of other corpus.
9. a kind of computer readable storage medium, which is characterized in that be stored with structure journey on the computer readable storage medium Sequence realizes the Chinese name Entity recognition language material as described in claim any one of 1-6 when the construction procedures are executed by processor The step of construction method in library.
CN201810325492.XA 2018-04-12 2018-04-12 Method, system, equipment and storage medium for constructing named entity recognition corpus Active CN108520065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810325492.XA CN108520065B (en) 2018-04-12 2018-04-12 Method, system, equipment and storage medium for constructing named entity recognition corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810325492.XA CN108520065B (en) 2018-04-12 2018-04-12 Method, system, equipment and storage medium for constructing named entity recognition corpus

Publications (2)

Publication Number Publication Date
CN108520065A true CN108520065A (en) 2018-09-11
CN108520065B CN108520065B (en) 2022-04-12

Family

ID=63432233

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810325492.XA Active CN108520065B (en) 2018-04-12 2018-04-12 Method, system, equipment and storage medium for constructing named entity recognition corpus

Country Status (1)

Country Link
CN (1) CN108520065B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399452A (en) * 2019-07-23 2019-11-01 福建奇点时空数字科技有限公司 A kind of name list of entities generation method of Case-based Reasoning feature modeling
CN111950288A (en) * 2020-08-25 2020-11-17 海信视像科技股份有限公司 Entity labeling method in named entity recognition and intelligent equipment
CN112182204A (en) * 2020-08-19 2021-01-05 广东汇银贸易有限公司 Method and device for constructing corpus labeled by Chinese named entities
CN113065353A (en) * 2021-03-16 2021-07-02 北京金堤征信服务有限公司 Entity identification method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649272A (en) * 2016-12-23 2017-05-10 东北大学 Named entity recognizing method based on mixed model
CN107239481A (en) * 2017-04-12 2017-10-10 北京大学 A kind of construction of knowledge base method towards multi-source network encyclopaedia

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649272A (en) * 2016-12-23 2017-05-10 东北大学 Named entity recognizing method based on mixed model
CN107239481A (en) * 2017-04-12 2017-10-10 北京大学 A kind of construction of knowledge base method towards multi-source network encyclopaedia

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LONGHUA QIAN 等: "Tree Kernel-Based Semantic Relation Extraction Using Unified Dynamic Relation Tree", 《INTERNATIONAL CONFERENCE ON ADVANCED LANGUAGE PROCESSING AND WEB INFORMATION TECHNOLOGY》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399452A (en) * 2019-07-23 2019-11-01 福建奇点时空数字科技有限公司 A kind of name list of entities generation method of Case-based Reasoning feature modeling
CN112182204A (en) * 2020-08-19 2021-01-05 广东汇银贸易有限公司 Method and device for constructing corpus labeled by Chinese named entities
CN111950288A (en) * 2020-08-25 2020-11-17 海信视像科技股份有限公司 Entity labeling method in named entity recognition and intelligent equipment
CN111950288B (en) * 2020-08-25 2024-02-23 海信视像科技股份有限公司 Entity labeling method in named entity recognition and intelligent device
CN113065353A (en) * 2021-03-16 2021-07-02 北京金堤征信服务有限公司 Entity identification method and device
CN113065353B (en) * 2021-03-16 2024-04-02 北京金堤征信服务有限公司 Entity identification method and device

Also Published As

Publication number Publication date
CN108520065B (en) 2022-04-12

Similar Documents

Publication Publication Date Title
CN108520065A (en) Name construction method, system, equipment and the storage medium of Entity recognition corpus
CN112131449B (en) Method for realizing cultural resource cascade query interface based on ElasticSearch
CN100461183C (en) Metadata automatic extraction method based on multiple rule in network search
CN110134800A (en) A kind of document relationships visible processing method and device
CN106951438A (en) A kind of event extraction system and method towards open field
JP7362998B2 (en) Method and device for acquiring POI status information
CN102682000A (en) Text clustering method, question-answering system applying same and search engine applying same
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
CN109740159B (en) Processing method and device for named entity recognition
US20130232147A1 (en) Generating a taxonomy from unstructured information
CN102236696A (en) Scalable incremental semantic entity and relatedness extraction from unstructured text
Richards et al. The Archaeology Data Service and the Archaeotools project: faceted classification and natural language processing
CN109345006A (en) A kind of trade and investment promotion policy analysis optimization method and system based on region development objective
CN111563382A (en) Text information acquisition method and device, storage medium and computer equipment
CN108170678A (en) A kind of text entities abstracting method and system
CN113742496B (en) Electric power knowledge learning system and method based on heterogeneous resource fusion
Nikhil et al. A survey on text mining and sentiment analysis for unstructured web data
CN112000929A (en) Cross-platform data analysis method, system, equipment and readable storage medium
CN107330009A (en) Descriptor disaggregated model creation method, creating device and storage medium
CN109992651A (en) A kind of problem target signature automatic identification and abstracting method
CN109740947A (en) Expert's method for digging, system, storage medium and electric terminal based on patent data
CN113407678A (en) Knowledge graph construction method, device and equipment
CN106021371A (en) Event recognition method and system
CN109241438B (en) Element-based cross-channel hot event discovery method and device and storage medium
CN109948015B (en) Meta search list result extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant