CN108520065A - Name construction method, system, equipment and the storage medium of Entity recognition corpus - Google Patents
Name construction method, system, equipment and the storage medium of Entity recognition corpus Download PDFInfo
- Publication number
- CN108520065A CN108520065A CN201810325492.XA CN201810325492A CN108520065A CN 108520065 A CN108520065 A CN 108520065A CN 201810325492 A CN201810325492 A CN 201810325492A CN 108520065 A CN108520065 A CN 108520065A
- Authority
- CN
- China
- Prior art keywords
- entity
- chinese
- name
- wiki
- internal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of construction methods of Chinese name Entity recognition corpus, based on computer, using Chinese wikipedia as language material, by the feature for extracting Chinese wikipedia entry, it can classify to Chinese wikipedia entry, determine Chinese Wiki entity entries, and predict the type of the corresponding name entity of Chinese Wiki entity entries, finally, the Chinese Wiki list of entities for including name entity is constructed based on type and redirection information, and Chinese name Entity recognition corpus can be made of all name entities in Chinese Wiki list of entities.Have the advantages that abundant in content and neighborhood covering degree is wide.Moreover, using this construction method, Chinese name Entity recognition corpus can be built automatically based on computer, used manpower and material resources sparingly.In addition, the invention also discloses a kind of structure system of Chinese name Entity recognition corpus, equipment and a kind of computer readable storage medium, effect are as above.
Description
Technical field
The present invention relates to natural language processing technique field, more particularly to name Entity recognition corpus construction method,
System, equipment and storage medium.
Background technology
The purpose of information extraction is to extract entity and its correlation from structureless free text, and be converted into knot
Structure expression-form, to provide data basis for the construction of knowledge base.
In the prior art, the research of Chinese name Entity recognition mainly uses the language material of high quality marked manually, such as
In January, 1998《People's Daily》Language material, Microsoft Research, Asia's MSRA language materials, City University of Hong Kong's CityU language materials and
ACE2005 Chinese language materials etc..The rule and language material scale of name entity class and mark used in different language materials have
Institute's difference, and in order to ensure the quality of language material, these language materials all need professional to be labeled, and not only limit language
The scale of material and field, it is also desirable to expend a large amount of manpower and materials.For example, in January, 1998 of News Field《The people day
Report》Language material, then not only language material content is outmoded, but also when it is applied to other fields in addition to News Field, accuracy is relatively low.
Therefore, how to build automatically a kind of having many advantages, such as abundant in content and wide application field Chinese name Entity recognition
Corpus is a technical problem that technical personnel in the field need to solve at present.
Invention content
The object of the present invention is to provide a kind of construction method, system, equipment and the storages of name Entity recognition corpus to be situated between
Matter, can build automatically a kind of having many advantages, such as abundant in content and wide application field Chinese name Entity recognition corpus.
In order to solve the above technical problem, the present invention provides it is a kind of Chinese name Entity recognition corpus structure side
Method is based on computer, including:
The feature of the Chinese wikipedia entry of extraction, and predict the corresponding life of Chinese Wiki entity entries according to the feature
The type of name entity;
Redirection information based on the type and the Chinese Wiki entity entries, structure include the name entity
Chinese Wiki list of entities is to constitute corpus;
Wherein, the Chinese Wiki entity entries are to include the Chinese wikipedia entry of the name entity.
Preferably, the feature of the Chinese wikipedia entry of the extraction is specially:
The feature is extracted from the message box, classfying frame and abstract of the Chinese Wiki entity entries.
Preferably, after the structure is comprising the Chinese Wiki list of entities of the name entity, further include:
It identifies the internal entity in the Chinese Wiki list of entities, and generates the nested name comprising the internal entity
Entity;
The nested list of entities of Chinese is added in the nested name entity, and is determined in the Chinese nested list of entities
Whether each internal entity meets nest relation;
The mark for the first internal entity for meeting the nest relation is removed, it includes to be unsatisfactory for the nest relation to delete
The nested name entity of second internal entity.
Preferably, the internal entity in the identification Chinese Wiki list of entities specifically includes:
Determine the name entity without type ambiguity in the Chinese Wiki list of entities;
Using the Chinese Wiki list of entities as dictionary, the name without type ambiguity is identified using longest match principle
The internal entity for including in entity.
Preferably, whether each internal entity of the determination meets nest relation and is specially:
Judge the Chinese Wiki entity entries and outside corresponding with the internal entity that the internal entity is directed toward
Whether the Chinese Wiki entity entries that entity is directed toward have intersection;
If it is, determining that the internal entity is the first internal entity for meeting the nest relation;
If it is not, then determining that the internal entity is the second internal entity for being unsatisfactory for the nest relation.
Preferably, in the mark of first internal entity for removing and meeting the nest relation, it includes to be unsatisfactory for delete
After the nested name entity of second internal entity of the nest relation, further include:
Judge the first external entity in the nested list of entities of Chinese whether be the second external entity internal entity;
If it is, the nested structure of first external entity is converged in second external entity.
In order to solve the above-mentioned technical problem, the present invention also provides it is a kind of Chinese name Entity recognition corpus structure system
System is based on computer, including:
Prediction module, the feature for extracting Chinese wikipedia entry, and predict that Chinese Wiki is real according to the feature
The type of the corresponding name entity of body entry;
Module is built, for the redirection information based on the type and the Chinese Wiki entity entries, structure includes
The Chinese Wiki list of entities of the name entity is to constitute corpus;
Wherein, the Chinese Wiki entity entries are to include the Chinese wikipedia entry of the name entity.
In order to solve the above-mentioned technical problem, the present invention also provides it is a kind of Chinese name Entity recognition corpus structure set
It is standby, including:
Memory, for storing construction procedures;
Processor, for realizing any Chinese name Entity recognition as described above when executing the construction procedures
The step of construction method of corpus.
In order to solve the above-mentioned technical problem, the present invention also provides a kind of computer readable storage medium, the computers
It is stored with construction procedures on readable storage medium storing program for executing, as described above any is realized when the construction procedures are executed by processor
The step of construction method of Chinese name Entity recognition corpus.
In terms of existing technologies, the construction method of Chinese name Entity recognition corpus provided by the invention, is based on
Computer, can be to Chinese Wiki by extracting the feature of Chinese wikipedia entry as language material using Chinese wikipedia
Encyclopaedia entry is classified, and determines Chinese Wiki entity entries, and predicts the corresponding name entity of Chinese Wiki entity entries
Type, finally, based on type and redirection information construct comprising name entity Chinese Wiki list of entities, can be in
All name entities in literary Wiki list of entities constitute Chinese name Entity recognition corpus.Due to wikipedia be one from
By content, open editor and multilingual network encyclopedia collaboration items, a large amount of name entity is covered, has content rich
The features such as rich and Covering domain is wide, so, the corpus of the construction method structure of Entity recognition corpus is named using this Chinese,
Equally have the advantages that abundant in content and neighborhood covering degree is wide.Moreover, this construction method is based on computer, it can be by corresponding
Computer program automatically extracts the feature of Chinese wikipedia entry, automatic Prediction name entity type and automatic structure Chinese dimension
Base list of entities can save a large amount of manpower and materials to constitute corpus.It therefore, being capable of automatic structure using this construction method
Building a kind of having many advantages, such as abundant in content and wide application field Chinese name Entity recognition corpus.In addition, the present invention also carries
A kind of structure system, equipment and a kind of computer readable storage medium of Chinese name Entity recognition corpus are supplied, effect is such as
On.
Description of the drawings
In order to illustrate the embodiments of the present invention more clearly, attached drawing needed in the embodiment will be done simply below
It introduces, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ordinary skill people
For member, without creative efforts, other accompanying drawings are can also be obtained according to these attached drawings.
Fig. 1 is a kind of flow chart of the construction method of Chinese name Entity recognition corpus provided in an embodiment of the present invention;
Fig. 2 is the flow of the construction method of another Chinese name Entity recognition corpus provided in an embodiment of the present invention
Figure;
Fig. 3 is the flow of the construction method of another Chinese name Entity recognition corpus provided in an embodiment of the present invention
Figure;
Fig. 4 is a kind of composition signal of the structure system of Chinese name Entity recognition corpus provided in an embodiment of the present invention
Figure;
Fig. 5 is a kind of composition signal of the structure equipment of Chinese name Entity recognition corpus provided in an embodiment of the present invention
Figure.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, rather than whole embodiments.Based on this
Embodiment in invention, those of ordinary skill in the art under the premise of not making the creative labor, obtained it is all its
His embodiment, belongs to the scope of the present invention.
The object of the present invention is to provide a kind of construction method, system, equipment and the storages of name Entity recognition corpus to be situated between
Matter, can build automatically a kind of having many advantages, such as abundant in content and wide application field Chinese name Entity recognition corpus.
In order to make those skilled in the art be better understood from technical solution of the present invention, below in conjunction with the accompanying drawings and it is embodied
The present invention is described in further detail for mode.
Fig. 1 is a kind of flow chart of the construction method of Chinese name Entity recognition corpus provided in an embodiment of the present invention.
The construction method of text name Entity recognition corpus provided in this embodiment, is based on computer, as shown in Figure 1, the construction method
Including:
S10:The feature of the Chinese wikipedia entry of extraction, and predict the corresponding life of Chinese Wiki entity entries according to feature
The type of name entity.
Wherein, Chinese wikipedia entry is that the entry of Chinese display is used in wikipedia;Chinese Wiki entity entries
For Chinese wikipedia entry corresponding with name entity;Name entity refers to an object with the real world, leads in the text
Often it is made of one or more continuous words;The type of name entity can be personage (nr), place name (ns) and tissue (nt)
Deng as " [Beijing] ns " indicates that Beijing is a place name entity.
Wikipedia is a free content, open editor and multilingual network encyclopedia collaboration items, is covered
A large amount of name entity, has the characteristics that abundant in content and Covering domain is wide, and content is presented in the form of entry one by one,
For each entry there are one corresponding Wiki page, the article in Wiki page, which summarizes, contains abundant structuring, half structure
Change and non-structured information, such as template, message box, page classifications etc., correlation of these information for natural language processing
Research has prodigious utility value.
The feature of Chinese wikipedia entry includes the validity feature excavated from wikipedia and Chinese feature is combined to add
The extension feature entered and meaning of a word feature etc., in order to classified to Chinese wikipedia entry according to these features, and
The type of the corresponding name entity of the Chinese Wiki entity entries of prediction.It being preferably carried out mode as one kind, extracts Chinese Wiki
The feature of encyclopaedia entry is specially:The extraction feature from the message box, classfying frame and abstract of Chinese Wiki entity entries.
For example, comprising entitled " Ma Yun " this name entity in a Chinese Wiki entity entries, in Chinese Wiki reality
There are " birth 1964 on September 10, ", " residence China's Mainland ", " nationality China name republicanism in message box in body entry
The summary info about main body such as state ", " Hangzhou Pedagogic University of Alma Mater ", " professional director chairman of administration of group of Alibaba ", then may be used
Using extraction " name ", " birth ", " residence ", " Alma Mater " and " occupation " etc. as word packet feature;The Chinese Wiki entity entries
Classfying frame have " 1964 be born ", " alive personage ", " hundreds of millions rich and powerful people of the People's Republic of China (PRC) ", " group of Alibaba " etc.,
Centre word " birth ", " personage " and " rich and powerful people " of each classification etc. can then be extracted and be used as feature;The Chinese Wiki entity entries
Abstract definition sentence be " Ma Yun (English name Jack Ma, on September 10th, 1964) People's Republic of China (PRC) enterpriser ", then may be used
To extract centre word " enterpriser " as feature.
In specific implementation, some can be selected to have marked the Chinese Wiki entity entries of good lot name entity type in advance
As the training data of existing grader, grader is trained, obtains disaggregated model.For step S10, Ke Yicong
Multiple features are extracted in the message boxes of Chinese Wiki entity entries, classfying frame and abstract, and it is real by this multiple feature to form name
Then the feature vector of body recycles existing grader and advance trained disaggregated model to predict feature vector,
Obtain the type of corresponding name entity.
S11:Redirection information based on type and Chinese Wiki entity entries, structure include the Chinese Wiki of name entity
List of entities is to constitute corpus.
One name entity may have multiple titles, including title and alias therefore can be according to Chinese Wikis
The redirection information of entity entries determines title difference, but represents the title of same name entity simultaneously, along with name is real
The type of body, it may be determined that a name entity.Because by name entity title and type can determine name entity,
So the Chinese Wiki list of entities of the redirection information structure based on type and Chinese Wiki entity entries at least should include life
The title and type two of name entity.Also, it is to be understood that in Chinese Wiki list of entities, same name entity
Title and type are corresponding.Finally, all name entities being written in Chinese Wiki list of entities constitute corpus.
Whether there is ambiguity to divide name entity according to entity name, respectively the name entity with title ambiguity
With the name entity of no title ambiguity;Also, for the name entity for having title ambiguity, and can be according to whether there is type
Ambiguity is to there is the name entity of title ambiguity to divide, the name entity for respectively having type ambiguity and the life without type ambiguity
Name entity.Wherein, title ambiguity refers to that the same name entity is directed toward two or more Chinese Wiki entity entries, class
Type ambiguity refers to that there is the same name entity of title ambiguity to have two or more type.
For example, name entity entitled " Bandung ", it is corresponding Chinese Wiki entity entries have " 22735 " and
" 5044266 ", then the name entity there is title ambiguity, if in Chinese Wiki entity entries " 22735 " the name entity class
Type is " ORG ", and the type of the name entity is " PER " in Chinese Wiki entity entries " 5044266 ", then the name entity has
Type ambiguity.
In Chinese Wiki list of entities, when Chinese Wiki entity row are written in the title and type of naming entity by needs
Table.Moreover, name entity can be divided into two kinds, one is the name entities of no type ambiguity, include the name of no title ambiguity
Entity and there are title ambiguity but the name entity without type ambiguity;Another kind is the name entity for having type ambiguity.Specific
In implementation, name entity for no title ambiguity and there is title ambiguity but without the name entity of type ambiguity, can will name
Chinese Wiki list of entities is added in the title of entity and corresponding unique type together;For the name with type ambiguity
Chinese Wiki list of entities can be added together by entity for the title for naming entity and corresponding multiple and different type.
Also, it is to be understood that the name entity with type ambiguity, certain to have title ambiguity simultaneously.
For example, entitled " Ke Lisidengnaijiate " of name entity, corresponding Chinese Wiki entity entries are
" 125 ", type are " PER ", then Chinese Wiki entity name list directly are added in " Ke Lisidengnaijiate PER ".
And if entitled " Bandung " of name entity, corresponding Chinese Wiki entity entries have " 22735 " and " 5044266 ", wherein
The type of name entity is " ORG " in Chinese Wiki entity entries " 22735 ", is ordered in Chinese Wiki entity entries " 5044266 "
The type of name entity is " PER ", then needs " Chinese Wiki entity name list is added in Bandung PER, ORG ".
In conclusion the construction method of Chinese name Entity recognition corpus provided in an embodiment of the present invention, based on calculating
Machine, can be to Chinese wikipedia by extracting the feature of Chinese wikipedia entry as language material using Chinese wikipedia
Entry is classified, and determines Chinese Wiki entity entries, and predicts the class of the corresponding name entity of Chinese Wiki entity entries
Type finally constructs the Chinese Wiki list of entities for including name entity based on type and redirection information, can be tieed up by Chinese
All name entities in base list of entities constitute Chinese name Entity recognition corpus.Since wikipedia is in a freedom
Hold, open editor and multilingual network encyclopedia collaboration items, cover a large amount of name entity, have it is abundant in content and
The features such as Covering domain is wide, so, the corpus of the construction method structure of Entity recognition corpus is named using this Chinese, equally
Have the advantages that abundant in content and neighborhood covering degree is wide.Moreover, this construction method is based on computer, corresponding calculating can be passed through
It is real that machine program automatically extracts the feature of Chinese wikipedia entry, automatic Prediction name entity type and the Chinese Wiki of automatic structure
Body list can save a large amount of manpower and materials to constitute corpus.Therefore, using this construction method, one can be built automatically
Kind has many advantages, such as abundant in content and wide application field Chinese name Entity recognition corpus.
The name entity for including in Chinese Wiki list of entities described above includes nested name entity and non-nested life
Name entity, and be not distinguish, so, the name entity for including in Chinese Wiki list of entities by mentioned earlier is constituted
Corpus also without distinguishing nested name entity and non-nested name entity.And in practical applications, since nesting is named
Entity contains the correlation between abundant entity information and entity, and its is complicated changeable, so nested name is real
The identification of body is also to be worth one of the task of research in information extraction.
Fig. 2 is the flow of the construction method of another Chinese name Entity recognition corpus provided in an embodiment of the present invention
Figure.As shown in Fig. 2, above-described embodiment is based on, as a preferred embodiment, in Chinese dimension of the structure comprising name entity
After base list of entities, further include:
S20:Internal entity in the Chinese Wiki list of entities of identification, and it is real to generate the nested name comprising internal entity
Body.
S21:Chinese nested list of entities is added in nesting name entity, and is determined each interior in the nested list of entities of Chinese
Whether portion's entity meets nest relation.
S22:The mark for the first internal entity for meeting nest relation is removed, it includes to be unsatisfactory for the second of nest relation to delete
The nested name entity of internal entity.
It should be noted that nested name entity refers to the name for naming entity nested inside one or more name entity
Entity;Inside name entity refers to the name entity being nested in inside nested name entity;External entity refers to being nested in nesting
Name the outermost name entity of entity;First internal entity refers to and meeting nest relation in Chinese nested list of entities
Internal entity, the second internal entity refers to and being unsatisfactory for the internal entity of nest relation in Chinese nested list of entities.
In the preferred embodiment, it after the internal entity in identifying Chinese Wiki list of entities, can generate
Include the nested name entity of internal entity, and the nested list of entities of Chinese is added in nesting name entity, judges Chinese nesting
Whether each internal entity in list of entities meets nest relation, so that it is determined that go out in Chinese nested list of entities, in which
Portion's entity meets nest relation, which internal entity is unsatisfactory for nest relation, finally retains the nested name for meeting nest relation
Entity deletes the nested name entity for being unsatisfactory for nest relation.For the first internal entity for meeting nest relation, directly
Remove its mark;And for the second internal entity for being unsatisfactory for nest relation, illustrate currently real comprising the second inside
The nested name entity of body can not determine whether really nesting names entity, this is possible be since wikipedia internal chaining lacks
Caused by mistake.At this point, it will includes that the nested of the second internal entity names entity to be deleted from Chinese Wiki list of entities then to need.
Finally, can Chinese nested name Entity recognition corpus be constituted by the nested name entity in Chinese nested list of entities.
Based on above-described embodiment as a preferred embodiment, identifying the internal entity in Chinese Wiki list of entities
It specifically includes:
Determine the name entity without type ambiguity in Chinese Wiki list of entities;
Using Chinese Wiki list of entities as dictionary, is identified in the name entity without type ambiguity and wrapped using longest match principle
The internal entity contained.
In the preferred embodiment, in using including in name entity of the longest match principle identification without type ambiguity
Portion's entity can promote recognition accuracy.Specifically, longest match principle can be used, no type discrimination is identified from left to right
The internal entity for including in the name entity of justice.
For example, name entity is " [Shanghai Communications University Xuhui school district] ns ", include " [Shanghai Communications University] in dictionary
Nt " and " [Xuhui] ns " two name entities, can directly obtain nested name entity " [[Shanghai Communications University] nt [Xuhui] ns
School district] ns ".
Based on above-described embodiment, as a preferred embodiment, determining whether each internal entity meets nest relation
Specially:
Judge in the Chinese Wiki entity entries and external entity corresponding with internal entity direction that internal entity is directed toward
Whether literary Wiki entity has intersection;
If it is, determining that internal entity is the first internal entity for meeting nest relation;
If it is not, then determining that internal entity is the second internal entity for being unsatisfactory for nest relation.
The Chinese dimension that the Chinese Wiki entity entries and external entity corresponding with internal entity that internal entity is directed toward are directed toward
Base entity entries have the case where intersection to have:
(1) internal entity is without title ambiguity, and the Chinese Wiki entity entries outside corresponding with the internal entity being directed toward
The Chinese Wiki entity entries that entity is directed toward are identical.
(2) internal entity has title ambiguity, and a certain Chinese Wiki entity entries being directed toward are corresponding with the internal entity
The Chinese Wiki entity entries that external entity is directed toward are identical.
The Chinese dimension that the Chinese Wiki entity entries and external entity corresponding with internal entity that internal entity is directed toward are directed toward
Base entity entries do not have the case where intersection to have:
(1) internal entity is without title ambiguity, and the Chinese Wiki entity entries being directed toward do not appear in and the internal entity
In the Chinese Wiki page internal chaining chained list of corresponding external entity.
(2) internal entity has a title ambiguity, and any Chinese Wiki entity entries being directed toward do not appear in it is interior with this
In the Chinese Wiki page internal chaining chained list of the corresponding external entity of portion's entity.
It should be noted that Chinese Wiki page internal chaining chained list refers to all directions occurred in a Chinese Wiki page
The connection of other Wiki pages, Wiki page are corresponding with wikipedia entry.
For example, name entity " [Tibet Autonomous Region] ns " and the same Chinese Wiki of name entity " [Tibet] ns " direction are real
Body entry, then the latter cannot function as the former internal entity.In fact, name entity " [Tibet Autonomous Region] ns " be one cannot
The entirety divided again, so deleting the mark for removing name entity " [Tibet] ns ".
It is a title ambiguity to name " Hong Kong " in entity " [Hongkong] ns ", the Chinese Wiki of some being directed toward
It is identical that entity entries and external entity are directed toward Chinese Wiki entity entries, still, in fact, " [Hongkong] ns " be one not
The entirety that can divide again, so removing the mark of " [Hong Kong] ns ".
It names in entity " [Ao Leierweier clarkes] ns " internal chaining list and is not present to name entity " [thunder difficult to understand] ns "
Reference, thus nest relation is invalid, and " [Ao Leierweier clarkes] ns " is removed from Chinese nested name list of entities.
It is a title ambiguity, any one pointed Wiki page to name " China " in entity " [Chinatown station] nt "
Face is not present in the Wiki page internal chaining list of external entity, and thus " China " is not an internal entity.It will
" [Chinatown station] nt " is removed from Chinese nested name list of entities.
It should be pointed out that the internal entity " [73 in name entity " [70 army's war of resistance bed of honour] ns "
Army] nt " be directed toward Wiki page do not appear in the Wiki page internal chaining list of external entity.By " [70 army] nt "
Removed from Chinese nested name list of entities, still " [70 army] nt " is its really nested name entity, this be by
Caused by wikipedia internal chaining missing.
Fig. 3 is the flow of the construction method of another Chinese name Entity recognition corpus provided in an embodiment of the present invention
Figure.As shown in figure 3, in order to which the nested internal entity named in entity is marked out to come as far as possible, further to refine nesting
The internal structure of entity is named, the accuracy of nested name entity in Chinese name Entity recognition corpus is improved, based on above-mentioned
Embodiment, as a preferred embodiment, after step s 22, further including:
S30:Judge the first external entity in Chinese nested list of entities whether be the second external entity internal entity,
If it is, S32 is entered step, if not, continuing to judge that next first external entity in Chinese nested list of entities is
No is the internal entity of the second external entity, until completing the judgement to all external entities in Chinese nested list of entities.
S31:The nested structure of first external entity is converged in the second external entity.
In this way, can the second outside be converged in fact as the first external entity of the internal entity of the second external entity for this
In body, that is to say, that in nested name entity corresponding with the second external entity, using the first external entity as internal entity
It marks out and, to achieve the purpose that the nested name entity internal structure of refinement, and then improve Chinese name Entity recognition corpus
The accuracy of middle nested name entity.Also, it is to be understood that by will be originally as the internal entity of the second external entity
First external entity converges in the second external entity, it is also possible that the data in Chinese nesting list of entities are more succinct.
Finally, can Chinese nested name Entity recognition corpus be constituted by the nested name entity in Chinese nested list of entities.
For example, name entity " [[Shanghai] ns university of communications] nt " appears in " [[Shanghai Communications University] nt [Xuhui] schools ns
Area] ns " inside, then can pool single nested name entity " [[[Shanghai] ns university of communications] nt [Xuhui] ns
School district] ns ".
Above for a kind of embodiment progress of the construction method of Chinese name Entity recognition corpus provided by the invention
Detailed description, the present invention also provides a kind of structure systems corresponding with the construction method, due to the implementation of components of system as directed
Example and the embodiment of method part mutually correlate, therefore the embodiment of components of system as directed refers to retouching for the embodiment of method part
It states, for identical part, repeats no more herein.
Fig. 4 is a kind of composition signal of the structure system of Chinese name Entity recognition corpus provided in an embodiment of the present invention
Figure.The structure of Chinese name Entity recognition corpus provided in this embodiment is based on computer, as shown in figure 4, the structure system
Including:
Prediction module 40, the feature for extracting Chinese wikipedia entry, and predict Chinese Wiki entity according to feature
The type of the corresponding name entity of entry.
Module 41 is built, for the redirection information based on type and Chinese Wiki entity entries, structure is real comprising name
The Chinese Wiki list of entities of body is to constitute corpus.
Wherein, Chinese Wiki entity entries are to include the Chinese wikipedia entry of name entity.
Since the structure system of Chinese name Entity recognition corpus provided in this embodiment is ordered with Chinese described above
The construction method of name Entity recognition corpus is corresponding, so, Chinese name Entity recognition corpus provided in this embodiment
Structure system and the construction method of Chinese name Entity recognition corpus described above have same advantageous effect, herein not
It repeats again.
Above for a kind of embodiment progress of the construction method of Chinese name Entity recognition corpus provided by the invention
Detailed description, the present invention also provides a kind of structure equipment corresponding with the construction method, due to the implementation of environment division
Example and the embodiment of method part mutually correlate, therefore the embodiment of environment division refers to retouching for the embodiment of method part
It states, for identical part, repeats no more herein.
Fig. 5 is a kind of composition signal of the structure equipment of Chinese name Entity recognition corpus provided in an embodiment of the present invention
Figure.As shown in figure 5, the structure equipment of Chinese name Entity recognition corpus provided in this embodiment includes:
Memory 50, for storing construction procedures;
Processor 51, for realizing any Chinese name Entity recognition language as described above when executing construction procedures
The step of expecting the construction method in library.
Since the processor in the structure equipment of Chinese name Entity recognition corpus provided in this embodiment is deposited in calling
When the construction procedures stored in reservoir, the structure side of any Chinese name Entity recognition corpus described above can be realized
The step of method, so, the structure equipment of Chinese name Entity recognition corpus provided in this embodiment and Chinese described above
Name the construction method of Entity recognition corpus that there is same advantageous effect, repeats no more herein.
Above for a kind of embodiment progress of the construction method of Chinese name Entity recognition corpus provided by the invention
Detailed description, the present invention also provides a kind of computer readable storage mediums corresponding with the construction method, due to calculating
The embodiment of machine readable storage medium storing program for executing part mutually correlates with the embodiment of method part, therefore computer readable storage medium portion
The embodiment divided refers to the description of the embodiment of method part, and for identical part, repeats no more herein.
A kind of computer readable storage medium provided in this embodiment, is stored with construction procedures, construction procedures are by processor
The step of construction method of any Chinese name Entity recognition corpus as described above is realized when execution.
Since the construction procedures stored on computer readable storage medium provided in this embodiment by processor when being called,
The step of capable of realizing the construction method of any Chinese name Entity recognition corpus described above, so, the present embodiment
The construction method of the computer readable storage medium of offer and Chinese name Entity recognition corpus described above has same
Advantageous effect, repeats no more herein.
Above to construction method, system, equipment and the storage medium of name Entity recognition corpus provided by the present invention
It is described in detail.Each embodiment is described by the way of progressive in specification, each embodiment stress be with
The difference of other embodiments, just to refer each other for identical similar portion between each embodiment.
It should be pointed out that for those skilled in the art, without departing from the principle of the present invention,
Can be with several improvements and modifications are made to the present invention, these improvement and modification also fall into the protection domain of the claims in the present invention
It is interior.
It should also be noted that, in the present specification, such as first and second etc relational terms are used merely to one
A entity is either operated either to operate with another entity and be distinguished without necessarily requiring or implying these entities or behaviour
There are any actual relationship or orders between work.Moreover, the terms "include", "comprise" or its any variant are intended to
Cover non-exclusive inclusion, so that the process, method, article or equipment including a series of element includes not only that
A little elements, but also include the other elements being not explicitly listed, further include either for this process, method, article or setting
Standby intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in the process, method, article or apparatus that includes the element.
Claims (9)
1. a kind of construction method of Chinese name Entity recognition corpus, which is characterized in that it is based on computer, including:
The feature of the Chinese wikipedia entry of extraction, and predict that the corresponding name of Chinese Wiki entity entries is real according to the feature
The type of body;
Redirection information based on the type and the Chinese Wiki entity entries, structure include the Chinese of the name entity
Wiki list of entities is to constitute corpus;
Wherein, the Chinese Wiki entity entries are to include the Chinese wikipedia entry of the name entity.
2. the construction method of Chinese name Entity recognition corpus according to claim 1, which is characterized in that the extraction
The feature of Chinese wikipedia entry is specially:
The feature is extracted from the message box, classfying frame and abstract of the Chinese Wiki entity entries.
3. the construction method of Chinese name Entity recognition corpus according to claim 2, which is characterized in that in the structure
It builds after the Chinese Wiki list of entities comprising the name entity, further includes:
Identify the internal entity in the Chinese Wiki list of entities, and it is real to generate the nested name comprising the internal entity
Body;
The nested list of entities of Chinese is added in the nested name entity, and determines each institute in the Chinese nested list of entities
State whether internal entity meets nest relation;
The mark for the first internal entity for meeting the nest relation is removed, it includes to be unsatisfactory for the second of the nest relation to delete
The nested name entity of internal entity.
4. the construction method of Chinese name Entity recognition corpus according to claim 3, which is characterized in that the identification
Internal entity in the Chinese Wiki list of entities specifically includes:
Determine the name entity without type ambiguity in the Chinese Wiki list of entities;
Using the Chinese Wiki list of entities as dictionary, the name entity without type ambiguity is identified using longest match principle
In include internal entity.
5. the construction method of Chinese name Entity recognition corpus according to claim 4, which is characterized in that the determination
Whether each internal entity meets nest relation:
Judge the Chinese Wiki entity entries and external entity corresponding with the internal entity that the internal entity is directed toward
Whether the Chinese Wiki entity entries of direction have intersection;
If it is, determining that the internal entity is the first internal entity for meeting the nest relation;
If it is not, then determining that the internal entity is the second internal entity for being unsatisfactory for the nest relation.
6. naming the construction method of Entity recognition corpus according to claim 3-5 any one of them Chinese, which is characterized in that
In the mark of first internal entity for removing and meeting the nest relation, delete comprising being unsatisfactory for the of the nest relation
After the nested name entity of two internal entities, further include:
Judge the first external entity in the nested list of entities of Chinese whether be the second external entity internal entity;
If it is, the nested structure of first external entity is converged in second external entity.
7. a kind of structure system of Chinese name Entity recognition corpus, which is characterized in that it is based on computer, including:
Prediction module, the feature for extracting Chinese wikipedia entry, and predict Chinese Wiki entity item according to the feature
The type of the corresponding name entity of mesh;
Module is built, for the redirection information based on the type and the Chinese Wiki entity entries, structure is comprising described
Name the Chinese Wiki list of entities of entity to constitute corpus;
Wherein, the Chinese Wiki entity entries are to include the Chinese wikipedia entry of the name entity.
8. a kind of structure equipment of Chinese name Entity recognition corpus, which is characterized in that including:
Memory, for storing construction procedures;
Processor, for being realized when executing the construction procedures, Chinese name entity is known as described in claim any one of 1-6
The step of construction method of other corpus.
9. a kind of computer readable storage medium, which is characterized in that be stored with structure journey on the computer readable storage medium
Sequence realizes the Chinese name Entity recognition language material as described in claim any one of 1-6 when the construction procedures are executed by processor
The step of construction method in library.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810325492.XA CN108520065B (en) | 2018-04-12 | 2018-04-12 | Method, system, equipment and storage medium for constructing named entity recognition corpus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810325492.XA CN108520065B (en) | 2018-04-12 | 2018-04-12 | Method, system, equipment and storage medium for constructing named entity recognition corpus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108520065A true CN108520065A (en) | 2018-09-11 |
CN108520065B CN108520065B (en) | 2022-04-12 |
Family
ID=63432233
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810325492.XA Active CN108520065B (en) | 2018-04-12 | 2018-04-12 | Method, system, equipment and storage medium for constructing named entity recognition corpus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108520065B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399452A (en) * | 2019-07-23 | 2019-11-01 | 福建奇点时空数字科技有限公司 | A kind of name list of entities generation method of Case-based Reasoning feature modeling |
CN111950288A (en) * | 2020-08-25 | 2020-11-17 | 海信视像科技股份有限公司 | Entity labeling method in named entity recognition and intelligent equipment |
CN112182204A (en) * | 2020-08-19 | 2021-01-05 | 广东汇银贸易有限公司 | Method and device for constructing corpus labeled by Chinese named entities |
CN113065353A (en) * | 2021-03-16 | 2021-07-02 | 北京金堤征信服务有限公司 | Entity identification method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649272A (en) * | 2016-12-23 | 2017-05-10 | 东北大学 | Named entity recognizing method based on mixed model |
CN107239481A (en) * | 2017-04-12 | 2017-10-10 | 北京大学 | A kind of construction of knowledge base method towards multi-source network encyclopaedia |
-
2018
- 2018-04-12 CN CN201810325492.XA patent/CN108520065B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649272A (en) * | 2016-12-23 | 2017-05-10 | 东北大学 | Named entity recognizing method based on mixed model |
CN107239481A (en) * | 2017-04-12 | 2017-10-10 | 北京大学 | A kind of construction of knowledge base method towards multi-source network encyclopaedia |
Non-Patent Citations (1)
Title |
---|
LONGHUA QIAN 等: "Tree Kernel-Based Semantic Relation Extraction Using Unified Dynamic Relation Tree", 《INTERNATIONAL CONFERENCE ON ADVANCED LANGUAGE PROCESSING AND WEB INFORMATION TECHNOLOGY》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399452A (en) * | 2019-07-23 | 2019-11-01 | 福建奇点时空数字科技有限公司 | A kind of name list of entities generation method of Case-based Reasoning feature modeling |
CN112182204A (en) * | 2020-08-19 | 2021-01-05 | 广东汇银贸易有限公司 | Method and device for constructing corpus labeled by Chinese named entities |
CN111950288A (en) * | 2020-08-25 | 2020-11-17 | 海信视像科技股份有限公司 | Entity labeling method in named entity recognition and intelligent equipment |
CN111950288B (en) * | 2020-08-25 | 2024-02-23 | 海信视像科技股份有限公司 | Entity labeling method in named entity recognition and intelligent device |
CN113065353A (en) * | 2021-03-16 | 2021-07-02 | 北京金堤征信服务有限公司 | Entity identification method and device |
CN113065353B (en) * | 2021-03-16 | 2024-04-02 | 北京金堤征信服务有限公司 | Entity identification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN108520065B (en) | 2022-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108520065A (en) | Name construction method, system, equipment and the storage medium of Entity recognition corpus | |
CN112131449B (en) | Method for realizing cultural resource cascade query interface based on ElasticSearch | |
CN100461183C (en) | Metadata automatic extraction method based on multiple rule in network search | |
CN110134800A (en) | A kind of document relationships visible processing method and device | |
CN106951438A (en) | A kind of event extraction system and method towards open field | |
JP7362998B2 (en) | Method and device for acquiring POI status information | |
CN102682000A (en) | Text clustering method, question-answering system applying same and search engine applying same | |
CN104978332B (en) | User-generated content label data generation method, device and correlation technique and device | |
CN109740159B (en) | Processing method and device for named entity recognition | |
US20130232147A1 (en) | Generating a taxonomy from unstructured information | |
CN102236696A (en) | Scalable incremental semantic entity and relatedness extraction from unstructured text | |
Richards et al. | The Archaeology Data Service and the Archaeotools project: faceted classification and natural language processing | |
CN109345006A (en) | A kind of trade and investment promotion policy analysis optimization method and system based on region development objective | |
CN111563382A (en) | Text information acquisition method and device, storage medium and computer equipment | |
CN108170678A (en) | A kind of text entities abstracting method and system | |
CN113742496B (en) | Electric power knowledge learning system and method based on heterogeneous resource fusion | |
Nikhil et al. | A survey on text mining and sentiment analysis for unstructured web data | |
CN112000929A (en) | Cross-platform data analysis method, system, equipment and readable storage medium | |
CN107330009A (en) | Descriptor disaggregated model creation method, creating device and storage medium | |
CN109992651A (en) | A kind of problem target signature automatic identification and abstracting method | |
CN109740947A (en) | Expert's method for digging, system, storage medium and electric terminal based on patent data | |
CN113407678A (en) | Knowledge graph construction method, device and equipment | |
CN106021371A (en) | Event recognition method and system | |
CN109241438B (en) | Element-based cross-channel hot event discovery method and device and storage medium | |
CN109948015B (en) | Meta search list result extraction method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |