CN112182204A - Method and device for constructing corpus labeled by Chinese named entities - Google Patents

Method and device for constructing corpus labeled by Chinese named entities Download PDF

Info

Publication number
CN112182204A
CN112182204A CN202010838001.9A CN202010838001A CN112182204A CN 112182204 A CN112182204 A CN 112182204A CN 202010838001 A CN202010838001 A CN 202010838001A CN 112182204 A CN112182204 A CN 112182204A
Authority
CN
China
Prior art keywords
entity
article
corpus
chinese
named
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010838001.9A
Other languages
Chinese (zh)
Inventor
马汶棋
姜春涛
张桐阁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hu Lin
Original Assignee
Guangdong Huiyin Trading Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Huiyin Trading Co ltd filed Critical Guangdong Huiyin Trading Co ltd
Priority to CN202010838001.9A priority Critical patent/CN112182204A/en
Publication of CN112182204A publication Critical patent/CN112182204A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for constructing a corpus labeled by a Chinese named entity, wherein the method comprises the following steps of; s101: the method comprises the steps of obtaining an article of Chinese Wikipedia, classifying the article into different entity types, wherein the classification method comprises the combination of classification based on heuristic rules, classification based on machine learning and classification based on a DBpedia knowledge base body; s102: the method comprises the steps of dividing an article into sentences to form a sentence subset, and labeling named entities of each sentence according to a link target of the article; s103: the sentences are screened to form a corpus. The method can construct the corpus annotated by the Chinese named entities through the external knowledge sources such as the Chinese Wikipedia knowledge base and the DBpedia knowledge base without manually constructing and annotating text data, has short time consumption and low cost, reduces the dependency of a named entity identification model trained by the corpus on the field data by the characteristics of the Chinese Wikipedia knowledge base and the Chinese Wikipedia knowledge base with wide field, and reduces the difficulty of changing and transplanting the named entity identification model.

Description

Method and device for constructing corpus labeled by Chinese named entities
Technical Field
The invention relates to the field of natural language processing, in particular to a method and a device for constructing a corpus labeled by Chinese named entities.
Background
Named entity recognition, namely: the Name Entity Recognition (NER), the most important subtask for information extraction, is to identify the most prominent units with the most information content, such as person name, company name, location, currency, time, etc., from the text. The named entities identified, for natural language processing tasks, such as: text classification, question-answering system and event extraction have very important functions. The purpose of named entity recognition is to determine the boundaries of named entities and their types from unstructured text. The named entity type has different definitions in different application scenarios, such as product name, attribute name in the industrial production field, and compound name in the biochemical field. Generally, the most common named entity types include: people, organizations, places.
Early named entity recognition systems [1], which were based primarily on linguistic tools (or manual rules) and dictionary lists, were limited to the common entities present in the dictionary and the entities that could be recognized by manual rules. However, the aforementioned systems are domain-oriented and difficult to develop. In order to overcome the above defects, named entity identification using a machine learning method becomes the mainstream, such as maximum entropy theory, AdaBoost. The supervised machine learning method infers a pattern related to a named entity from a labeled training corpus based on characteristics of a morphology, a lexical method, a syntax, a context and the like. The supervised machine learning method becomes the mainstream named entity identification method at present, however, the prerequisite for applying the method is that a large amount of text data needs to be constructed and labeled manually, which is a time-consuming, labor-consuming and skill-requiring task. The manual annotation data currently used in machine learning methods is limited to only a few public news corpora based on generic named entity types. If a new named entity recognition model is to be trained for different languages and predefined named entity types, the corresponding named entity labeled corpus becomes a necessity. In addition, when the named entity recognition model obtained by training in a specific domain is applied to other untrained domains, the performance is lower than that of the named entity recognition model. Because of the dependence on domain data of machine learning-based named entity recognition methods, adapting and migrating existing named entity recognition systems to new domains becomes quite difficult.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method and a device for constructing a corpus of a Chinese named entity label, the corpus of the Chinese named entity label can be constructed through external knowledge sources such as a Chinese Wikipedia knowledge base and a DBpedia knowledge base, text data does not need to be constructed and labeled manually, time consumption is short, cost is low, the dependency of a named entity recognition model trained by the corpus on field data is reduced due to the fact that the Chinese Wikipedia knowledge base is wide in involved range and contained field, and the difficulty of changing and transplanting the named entity recognition model is reduced.
In order to solve the above problems, the present invention adopts a technical solution as follows: a method for constructing a corpus annotated by Chinese named entities, the method comprising; s101: the method comprises the steps of obtaining an article of Chinese Wikipedia, classifying the article into different entity types, wherein the classification method comprises the combination of classification based on heuristic rules, classification based on machine learning and classification based on a DBpedia knowledge base body; s102: the article is divided into sentences to form a sentence subset, and each sentence is labeled with a named entity according to a link target of the article; s103: and screening the sentences to form a corpus.
Further, the heuristic rule-based classification determines the classification of the articles through wikipedia classification, attribute box templates, cross-language links, and article titles.
Further, the classification based on machine learning takes the first sentence, the structured information, the category label and the title of the article as the basic feature set, and the article is classified according to the basic feature set.
Further, the categorization of the DBpedia knowledge base ontology based on entity disambiguation of entities described in the article by the DBpedia knowledge base data.
Further, the step of labeling the named entity of each sentence according to the link target of the article specifically includes: acquiring all external links of the article, and constructing a candidate alias set according to the external links; and constructing a trie tree through the candidate alias set, searching a mark corresponding to a sentence in the sentence set according to the trie tree, and labeling the named entity of the sentence according to the mark.
Further, the step of screening the sentence forming corpus specifically includes: and screening the sentence forming corpus according to the input instruction and/or the named body recognition tool.
Further, the step of screening the sentence formation corpus according to the named body recognition tool specifically includes: and identifying the label of the sentence through two different named body identification tools, and judging whether to modify the label of the sentence and whether to place the sentence into a corpus according to the identification result.
Further, the step of labeling the named entity for each sentence according to the link target of the article further comprises: extending an entity dictionary based on the article.
Further, the step of expanding the entity dictionary based on the articles specifically includes: retrieving articles corresponding to each seed entity based on predefined categories of the entity dictionary; marking the seed entity according to the fine-grained category of the article; and identifying potential named entities according to links contained in the articles corresponding to the seed entities, judging the categories of the potential named entities through a matching function, and determining whether the potential named entities can be used as entries of the entity dictionary according to a judgment result.
Based on the same inventive concept, the application also provides a device for constructing the corpus labeled by the Chinese named entities, wherein the device comprises a processor and a memory, and the processor is coupled with the memory; the memory stores program data that the processor executes to implement the method of constructing a corpus of chinese named entity annotations described above.
Compared with the prior art, the invention has the beneficial effects that: the corpus labeled by the Chinese named entity can be constructed through the external knowledge sources such as the Chinese Wikipedia knowledge base and the DBpedia knowledge base without manually constructing and labeling text data, the time consumption is short, the cost is low, the related range and the containing field of the Chinese Wikipedia knowledge base reduce the dependency of a named entity recognition model trained by the corpus on the field data, and the difficulty of changing and transplanting the named entity recognition model is reduced.
Drawings
FIG. 1 is a flowchart of an embodiment of a method for constructing a corpus annotated with Chinese named entities according to the present invention;
FIG. 2 is a flowchart illustrating an embodiment of a method for constructing a corpus annotated with Chinese named entities according to the present invention;
FIG. 3 is a diagram illustrating an embodiment of an apparatus for constructing a corpus labeled with named entities according to the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.
Referring to fig. 1-2, fig. 1 is a flowchart illustrating a method for constructing a corpus annotated by a named entity in chinese according to an embodiment of the present invention; FIG. 2 is a flowchart illustrating an embodiment of a method for constructing a corpus annotated with Chinese named entities according to the present invention. The method for constructing the corpus annotated by the Chinese named entity according to the present invention will be described in detail with reference to fig. 1-2.
Since about 74% of wikipedia articles are topics about describing various entity types, many of the links of wikipedia correspond to annotations of the respective entities. For each sentence in the Chinese Wikipedia article, if the sentence contains an explicit tag (i.e., an explicit markup) for a mentioned entity, determining the type of the mentioned entity by performing entity classification on the explicit markup, and tagging the entity with it; if the sentence contains an implicit marking (i.e., an implicit marking) of the mentioned entity, the mentioned entity is marked by the boundary of the implicit marking and the identification of the entity type to which it belongs.
In this embodiment, the method for constructing a corpus annotated by a chinese named entity includes:
s101: the method comprises the steps of obtaining an article of Chinese Wikipedia, and classifying the article into different entity types, wherein the classification method comprises the combination of classification based on heuristic rules, classification based on machine learning and classification based on DBpedia knowledge base ontology.
In this embodiment, the device for executing the method for constructing the corpus of the chinese named entity annotations may be a computer, a server, a cloud server, a tablet computer, a mobile phone, and other physical intelligent devices or virtual terminals.
In this embodiment, classification based on heuristic rules, classification based on machine learning, and classification based on the DBpedia knowledge base ontology may be freely combined according to actual needs of users and then used for article classification. Wherein, the mode of combination includes:
1) a combination of heuristic rule-based classification and machine learning method-based classification, 2) DBpedia ontology-based classification, 3) a variety of combinations of heuristic rule-based classification, machine learning method-based classification, and DBpedia knowledge base ontology-based classification.
In the present embodiment, the article is divided into four entity types, please refer to table 1, where table 1 is an example table of rules based on four named entity types.
Entity type Category schema
PER Occupation or status: [*]Actor [. X [ ]]Singer [. X [ ]]Writer for doing homework
PER The special time is as follows: [ front part of]<TIME>At birth, [ front part ]]<TIME>Deceased world
PER The common concept is as follows: [+]Human being (thing)],[+]Scientist
LOC The region: [+]Country of [ +]City, C + C]Town and town
LOC Natural places: [+]River, [ +]Mountain, [. sup. ]]Mountain range
LOC In public places: [+]Square, [ +]Museum [ +]Station
ORG Related companies: [+]Company [ +]Mechanism [ +]Enterprise
ORG Political union: [+]Political party
ORG Other organizations: [+]University [ +]Club [. X [ ]]Publishing company
OTH [*]Plant, [. alpha. ]]Animals, [. alpha. ]]Film
TABLE 1 rule example Table based on four named entity types
In this embodiment, the classification based on heuristic rules determines the category of the article through wikipedia category, attribute box template, cross-language link, article title.
Wherein, Wikipedia categories: the category pattern list is built using predefined expressions, 4 customized expressions are used among the category patterns, including: '[ X ]' expresses optional character 'X', '[ + ]' expresses any character (occurs zero or more TIMEs), '[ + ]' expresses a character that occurs at least once, '< TIME >' expresses TIME expressions such as '1998', '21 st century', etc. Table 1 shows some representative examples of rules for People (PER), Location (LOC), Organization (ORG), and Other (OTH) [ non-entity types ].
An Infobox template is a special template that usually contains a reduced set of important facts about the entity represented by the article. Wikipedia editors, in general, use the same property box template for similar articles. For example, the property box template 'infoboxfootball biology' is most likely to appear in the wikipedia article describing the character 'Lionel Messi (rio meisi)'. In this way, the entity types expressed by wikipedia articles using the same attribute box template are determined to be the same.
Cross-language linking: wikipedia is a multi-lingual public knowledge base covering up to 200 languages, and the same wikipedia article described using different languages is linked by special Wiki tags. Thus, a general rule can be defined as follows: for a Chinese Wikipedia article, searching an English Wikipedia article corresponding to the article, and determining the named entity type of the Chinese article according to the named entity type expressed by the corresponding English article.
In one particular embodiment, Chinese articles of a non-entity type (i.e., entity type OTH) are identified by capitalization conventions for the associated English articles, which should follow the following criteria: 1) if the title of the associated English article contains a temporal expression, then its corresponding Chinese article is marked as an OTH type; 2) if the number of associated English articles in the title whose first character is lower case is greater than or equal to the number of upper case, then their corresponding Chinese article is marked as OTH type.
The article title: the title of a wikipedia article is a brief and unique phrase corresponding to a particular web page. In the Chinese Wikipedia, a part of the article title's common nouns at the end of the article title can serve as a good indicator of the type of named entity, such as ' [ + ] railway station ', ' [ + ] bank '. In addition, about 10% of the chinese wikipedia articles are labeled with a bracket expression to resolve ambiguity between articles with the same name, such as "beautiful girls (1987 movie)", which are classified by their titles.
In this example, through CXRepresenting degrees of membership for four named entity classes (PER, ORG, LOC, OTH), then if a Wikipedia article can be matched to the relevant rules, its CXThe value is set to 1; and the total membership C of the articleaThe method is calculated by the weighted sum of the four types of heuristic rules, and finally, the entity type with the highest total membership value is selected as the category of the article.
The heuristic rules classify the Wikipedia articles, and a large amount of misjudgments exist, for example, the articles representing the 'smiths' are recognized as names of people because of the class of the 'smiths', so that the classification results obtained by the heuristic rules are reclassified by using a machine learning method to reduce the misjudgments.
In this embodiment, the classification based on machine learning uses the first sentence, the structured information, the category label, and the title of the article as the basic feature set, and classifies the article according to the basic feature set.
Wherein, the first sentence of the article: the classification of the article can be inferred from the first sentence of the article, such as the first sentence "Huoshi disease (the first 140 years-the first 117 years), Hedongdou county, Pingyang county, aunt nephew by Wei Fu, Wei's outer 29989, brother of Huoshan, the Ministry of Hanwudi, West Han, to confront Hununa. "from this information, the TF (Term Frequency) -IDF (Inverse text Frequency index) value of each word in the article is calculated, and then the top 2000 words ranked from high to low in information gain value are retained as features.
Structuring information: the method mainly comprises section titles of articles, the types of attribute boxes (Infobox), and the relationship and the types of attributes of triples in the attribute boxes. For example, the section title of the article contains information supporting classification: "life flat", "life trace"; category of the attribute box: "Infobox China County", the type of relationship and attribute in the triplet: the "founder".
Category label: category information satisfying the following three conditions is selected as a feature: (1) more than twice; (2) the number of categories of articles comprising the category is less than or equal to 1; (3) in all categories, the category ranks top 200 based on variance values of the corpus from high to low.
Title: the title of each article in the Chinese Wikipedia can provide the most direct information for classification, and the classification is mainly identified by using the following 6 types of features:
whether the Chinese surname is used for providing clues for identifying the Chinese name or not;
whether the length is between 2 and 4 is used for providing clues for identifying the names of Chinese people;
whether 'contains', is used to provide clues for identifying foreign names;
whether GPE (geographic/social/political entity) is contained for providing clues for identifying organization names;
whether the last word contains direct information, such as "+ ] mountain;
words in parentheses, such as "(movie)", "(athlete)", wherein the characters containing time, number and GPE are converted to corresponding notations, such as "(chinese capital)" to "(< GPE > capital)".
In this embodiment, entities described in the article are entity disambiguated by the DBpedia knowledge base data based on the classification of the DBpedia knowledge base ontology.
In this embodiment, the work of classifying (i.e., DBpedia entity disambiguation) -named entity recognition based on the DBpedia knowledge base ontology is limited to certain resource-rich languages, such as english, and entity disambiguation is performed on entities described in wikipedia in chinese using DBpedia knowledge base data.
DBpedia is a universal knowledge base that contains all the entity sets extracted from wikipedia articles. The DBpedia repository allows users to query the attributes and relationships of wikipedia resources in a semantic-based manner.
Each entity can be classified into an entity category in the DBpedia ontology by mapping between the attribute box of the Wikipedia article and the DBpedia knowledge base ontology class; in the wiki knowledge base, the linked entity terms in the sentence directly point to the corresponding target articles in the wiki knowledge base, and each target article corresponds to one entity in the DBpedia knowledge base.
DBpedia expresses the extracted information using RDF, where each piece of factual information is represented in the form of a triplet: 'S (subject), P (predicate), O (object)', and accesses the triple data in SQL-like query language SPARQL.
In a specific embodiment, if it is desired to obtain what the type of entity 'plum' is, the corresponding SPARQL query result is:
<http://dbpedia.org/resource/Li_Bai>
P:<http://www.w3.org/1999/02/22-rdf-syntax-ns/#type>
O:<http://www.w3.org/2002/07/owl/#Thing>
<http://dbpedia.org/ontology/Agent>
<http://dbpedia.org/ontology/Person>
<http://dbpedia.org/ontology/Writer>
in this example, the entity 'lisk' is classified into, for example, 'writer', 'person' and 'action subject'. Thus, the linked entities can be classified using the following SPARQL query:
selectdistincttwhere
{<http://dbpedia.org/resource/Li_Bai>
<<http://www.w3.org/1999/02/22-rdf-syntax-ns/#type>?t}
because the DBpedia knowledge base does not have corresponding DBpedia ontology class data expressed in Chinese, the DBpedia knowledge base finds corresponding entity types from the DBpedia knowledge base by the Chinese entities in the DBpedia knowledge base in other languages for marking. For example, the entity 'Spanish', which points to the corresponding entity in the German DBpedia knowledge base, can be represented by the following RDF statements:
< http:// zh.dbpedia.org/resource/Spain >
<http://www.w3.org/2002/07/owl/#sameAs><http://de.dbpedia.org/resource/Spanien>
In the German DBpedia knowledge base, the entity type of < http:// de.dbpedia.org/resource/Spanien > is < http:// dbpedia.org/ontology/Country >, so the type of the entity 'Spain' is 'Country'.
S102: the method comprises the steps of dividing an article into sentences to form sentence subsets, and labeling named entities of each sentence according to a link target of the article.
In this embodiment, the content of the article is divided into sentences by punctuation marks of the article to form a sentence subset.
In this embodiment, the specific steps of labeling the named entity for each sentence according to the link target of the article include: acquiring all external links of an article, and constructing a candidate alias set according to the external links; and constructing a trie (prefix tree) through the candidate alias set, searching all possible marks for the entity contained in each sentence according to the trie tree, selecting the best mark, and labeling the named entity.
Wherein, for a sentence in wikipedia, if the article pointed by the character string of the link contained in the sentence has a determined entity type label, the link character string can be labeled as the corresponding entity type (in this case, it is abbreviated as explicit label). As an example sentence: "Wangbo (650-676 years), Zi' an,jiangzhou teaGantryA human. ", wherein the sentence contains a link string comprising: "Jiangzhou tea"and"Gantry"each point to a place name entity, whereby" magenta "and" gantry "can be labeled as" LOC (i.e., location) ". However, wikipedia only establishes a link to an entity appearing in an article for the first time, and does not perform a link mark for the same entity appearing in other positions of the articleNote (in this case, abbreviated as: implicit notation). In order to effectively utilize text resources, in the actual labeling process, besides the entity appearing for the first time is labeled, entities appearing in other positions, including aliases, are labeled. According to the above thought, for the entity represented by each article, an alias set is constructed first, and the construction principle is as follows:
for a given article a, all articles redirected to a are named as aliases of a-some entities in wikipedia knowledge base, and are not described by corresponding articles, but are redirected to only one article with another entity name, and the mechanism is called "redirection". Redirection titles for wikipedia articles are typically aliases expressing the same entity or concept.
In a specific embodiment, all strings linking A more than 5 times are aliases of A.
The Wikipedia knowledge base has the structure and other label information, can automatically construct a corpus for named entity recognition, and can also be used for automatically generating entity dictionaries of different categories or expanding the entity dictionaries based on the existing dictionaries. The hyperlinks in the wikipedia articles connect the articles of the input entity with the articles of other related entities, so that by tracking these links, a large number of entities related to the input entity (i.e., the seed entity) can be obtained.
Thus, the entity dictionary can be extended according to the article of wikipedia in chinese.
In this embodiment, the step of performing named entity tagging on each sentence according to a link target of the article further includes: the entity dictionary is extended based on the articles.
The step of expanding the entity dictionary based on the article specifically comprises the following steps: retrieving articles corresponding to each seed entity based on predefined categories of the entity dictionary; marking the seed entity according to the fine-grained category of the article; and identifying potential named entities according to links contained in the articles corresponding to the seed entities, judging the categories of the potential named entities through a matching function, and determining whether the potential named entities can be used as entries of the entity dictionary according to a judgment result.
In a specific embodiment, the specific steps are as follows:
retrieving articles corresponding to the seed entities: based on the existing named entity dictionary of predefined categories (abbreviated 'pdc'), each seed entity in the named entity dictionary is searched in the Chinese Wikipedia knowledge base, and an article describing the entity is returned. The aforementioned search operation may result in the following three cases: (1) an article is returned that describes an entity without any ambiguity-in this case, the article is matched to the corresponding seed entity. (2) And (4) a result is not returned, in this case, fuzzy matching is carried out by adopting a 'leftmost changematch' rule, and the Wikipedia article which is closest to the upper concept is selected. (3) For a lexical polysemous seed entity concept, an 'ambiguous page' is returned for listing entities with different meanings and links to corresponding article pages, for which case the seed entity is discarded directly without proceeding to the subsequent step of tagging the seed entity.
Marking the seed entity: once the Wikipedia articles describing the seed entity are retrieved, the fine-grained categories extracted based on the contents of these articles, i.e., the superordinate concepts (abbreviated as: 'hpc'), are used to label the seed entity. How to obtain reliable hpc? According to the characteristics of the Wikipedia knowledge base text, the following two methods can be adopted: first sentence tagging method: for the Chinese Wikipedia knowledge base, corresponding category labels can be extracted from the first sentence of the Wikipedia article and used as the fine-grained category (namely, hpc) of the entity; in addition, noun phrases connected by relational links are also considered hpc for the entity. Category label method: the structure of categories based on the Wikipedia knowledge base (i.e., Wikipedia articles are labeled with one or more category labels) is a generalized concept set composed of a hierarchical structure, usually in the form of a category tree, that generally expresses broad and narrow relationships between categories. Similar wikipedia articles are labeled by the same category. By the marking strategy, the labels of the Wikipedia article can be extracted, filtered and the most suitable label can be selected as the hpc of the corresponding entity. The application of the category label method can be divided into the following three cases: the category labels of some Wikipedia articles are the same as the titles of the articles, in this case, the category label on the upper layer of the category is used as a label by traversing the category tree; wikipedia has labels that are arbitrarily labeled for administrative purposes, such as 'wikipedia templates', that introduce noise and thus filter out the noise by manually constructing a stop word list; to further reduce noise, only hpc that can label at least two seed entities is retained.
The foregoing classification step may generate a pool of hpcs, where each hpc is a generic concept of an input entity or a potential generic concept of a pdc, which may serve as a control vocabulary to guide the expansion of similarly named entities.
And judging the potential named entities: potential named entities are identified by links contained in the Chinese Wikipedia article obtained by searching the seed entities, namely whether the category of each candidate named entity belongs to the pdc is judged. Thus, the method of the classification step is adopted to obtain hpc of the candidate named entity, and each extracted hpc is matched with the hpc pool generated by the classification step through a matching function. If the matching is successful, the category of the candidate entity, namely the title corresponding to the Chinese Wikipedia article, is used as the entry of the entity expansion dictionary. The matching function can be simply defined as: and confirming whether the hpc of the candidate entity exists in a hpc pool extracted by the seed word dictionary.
The aforementioned extension step is envisaged in that: if a candidate entity and a source entity share the same notion of context, then any notion of context above that notion, and ultimately the category of the source entity, should also apply to the candidate entity. In addition, in the wikipedia knowledge base, all redirected titles point to the same wikipedia article; the redirection header of the wikipedia knowledge base is usually an alias of the expressed entity or concept, and the alias is also selected as an entity and added into the entity expansion dictionary.
S103: the sentences are screened to form a corpus.
Through the foregoing steps, a sentence labeled with a named entity is obtained, however, the labeled sentence has different quality, mainly because:
article classification there is a misclassification: for example, an article whose entity concept is "revolutionary", if classified as a person's name, then if an article contains an outward link to "revolutionary", all of the "revolutionary" characters in the article will be labeled as the person's name.
The links in the article are not all: assuming that an article with the entity concept of "white cloud mountain" is correctly classified as "Location", but an article does not create a jump link for the character "white cloud mountain" to point to the white cloud mountain when being edited, the white cloud mountain in the article cannot be labeled as a place.
The articles contained in the training corpus are not complete: for the wikipedia knowledge base in chinese, descriptions of many entity concepts are also missing, for example, "black cloud mountain" if it is an entity concept and is not included in the wikipedia knowledge base, then it is impossible for the strings of black cloud mountain contained in the corpus to be labeled as places.
Therefore, entity labeled sentences need to be filtered to improve the sentence content in the corpus.
In this embodiment, the step of screening the sentence formation corpus specifically includes: and screening the sentences to form a corpus according to the input instruction and/or the named body recognition tool.
In this embodiment, the step of screening a sentence generation corpus according to a named entity recognition tool specifically includes: and identifying the label of the sentence by two different named body identification tools, and judging whether to modify the label of the sentence and to place the sentence into the corpus according to the identification result.
The invention is further illustrated by the following specific examples.
In the embodiment 1, for classifying the wikipedia articles in chinese, since heuristic rules are applied separately, the data coverage is not high, and many results are wrong, the embodiment reconstructs a training set (abbreviated as 'hybrid c') by using a machine learning method and combining the classification results of the heuristic rules, in order to reduce the noise of the training set to a certain extent, and the strategy is as follows: the method comprises the steps of applying a heuristic rule-based classifier and a machine learning-based classifier to classify Chinese Wikipedia articles respectively, and selecting a new sample set if and only if the classification results of the Chinese Wikipedia articles and the classification results of the Chinese Wikipedia articles are consistent and the prediction probability value of the machine learning classifier is greater than 99%. Furthermore, the corresponding label is marked for the corresponding article through the redirection information of the Chinese Wikipedia knowledge base.
More specifically, in the present embodiment, an SVM (support vector machine) algorithm is used as a classifier for machine learning, and the following number of samples are randomly selected as a training set from articles labeled by a classification method applying heuristic rules (see table 2). To evaluate the performance of the SVM classifier, 444 were randomly selected from the wikipedia articles and manually labeled as a test set.
Entity PER LOC ORG OTH
Number of 5000 5000 5000 5000
TABLE 2 distribution List of training data for SVM classifier
The results obtained on the test set using the SVM algorithm tool in scimit-learn library using default parameters are shown in table 3, where: the Precision is 'Precision', the Recall is Recall ', and the F1' is the harmonic mean value of the two as an evaluation index for evaluating the performance of the classification model.
PER LOC ORG OTH TOTAL
Precision(%) 86.1 93.5 94.0 69.7 85.8
Recall(%) 90.3 76.3 81.7 89.2 84.4
F1(%) 88.2 84.1 87.4 78.3 84.5
TABLE 3 Classification result List of SVM classifiers
The results obtained via the hybrid c classification strategy using the same training and testing data are shown in table four: comparing the results in tables 3 and 4, it can be seen that the quality of the training samples obtained by using the hybrid C strategy is higher than that of the heuristic rule classifier alone. All Chinese Wikipedia articles are classified through a hybrid C strategy, and the size of the obtained final sample set is 43 thousands of articles.
PER LOC ORG OTH TOTAL
Precision(%) 85.0 94.8 93.1 74.2 86.8
Recall(%) 93.2 79.8 82.6 88.3 86.0
F1(%) 88.9 86.7 87.6 80.7 85.9
TABLE 4 classification results of HybridC-based Chinese Wikipedia articles
Embodiment 2, the sentence marked on the named entity is automatically checked.
In the implementation process of constructing the corpus of named entity labels, the quality of the sentence sets obtained after the named entity labels is uneven, and manual review is usually required. However, the process of manual auditing is expensive, and the present embodiment proposes a method of automatic auditing by machine.
The named entity recognition tool is used for replacing a person to carry out auditing, and two common named entity recognition tools are selected in the implementation process: harmony big language clouds (https:// www.ltp-cloud. com) and StanfordLP (https:// stanfordlp. github. io/CoreNLP /). "Michel Jordan, born in Brukraining, New York, 2 months and 17 days 1963, former American professional basketball players, post-construction scoring, nickname 'flying man'. "as an example sentence, for convenience of description, the tool used is hereinafter referred to as an" auditor ". The auditing process for the example sentence is as follows, wherein the first column is the generated label (i.e. the label to be audited), the second column is the result of applying the hafford language cloud (No. 1 auditor for short), and the third column is the result of applying stanfordlp (No. 2 auditor for short):
and marking the target entity to be audited as 'PER (person)/LOC (location)/ORG (organization)', if any auditor supports marking (namely the marking given by the auditor is the same as that of the sentence), considering that the marking in the first column is correct, and otherwise, considering that the marking in the first column is wrong, and discarding the sentence. For example: the marking and auditing results of the character string 'Michael' are as follows: "michael B _ persphere", where auditor No. 1 supports annotations, and the annotations in the first column are considered correct even though auditor No. 2 does not.
The target entity to be audited is labeled as 'OTH', and the following three situations can be divided: if the labeling results of the two auditors are the same, modifying the original label; if the labeling results of the two auditors are different, the sentence is considered to be very fuzzy and is not suitable for being used as a corpus, and the sentence is discarded; for the newly added entity types (time, number and measurement unit), as long as the label of the auditor is a new entity, the new entity is labeled; and if the labels of the two auditors conflict with each other, the label of the No. 2 auditor is taken as the standard.
Through the above process, 85 million high-quality corpora are screened out from over 400 million sentences labeled by named entities obtained from the sample set of embodiment 1. From the 85 ten thousand sentences filtered by the above-mentioned examination, the first 30 ten thousand sentences were selected as the corpus, and 10-fold cross-check was performed to train the named entity recognition model, and the results are shown in table 5:
Precision(%) Recall(%) F1(%)
PER 0.9392 0.9204 0.9297
ORG 0.8772 0.8402 0.8531
LOC 0.9643 0.9600 0.9622
Total 0.9269 0.9069 0.9167
TABLE 5 result List of named entity recognition model based on 10-fold cross-validation
Example 3 classification of wikipedia articles in chinese using DBpedia ontology.
The language with the most entity type data in the DBpedia knowledge base is shown in table 6:
Figure BDA0002640396730000101
table 6, the languages with the top 8 ranks of the entity number in DBpedia
The articles of the Chinese Wikipedia are classified by: collecting entity type data-the collected entity type data includes: the entity types in 8 languages, the entity type data included in Wikidata (https:// www.wikidata.org/wiki/Wikidata: Main _ Page), and the entity type data in Wikidata pointing to the GeoNames (http:// www.geonames.org /) database are shown in Table 6.
Collecting RDF data pointing to other languages DBpedia and Wikidata in a Chinese DBpedia knowledge base, wherein the examples are as follows:
< http:// zh.dbpeida.org/resource/Morgan.
<http://www.w3.org/2002/07/owl/#sameAs>
<http://www.wikidata.org/entity/Q48337>
<http://wikidata.dbpedia.org/entity/Q48337>
<http://dbpedia.org/resource/Morgan_Freeman>
<http://es.dbpedia.org/resource/Morgan_Freeman>
And for each entity type, finding all the subtypes thereof according to the relationship among the entity types, and carrying out corresponding marking. The Chinese DBpedia knowledge base has 1215474 externally linked Chinese entities, wherein 667345 Chinese entities have entity type labels, and the covered Chinese Wikipedia article is 653539.
For Chinese entities with external links, all tags are added to the Chinese entities and converted into corresponding entity type tags. Because a Chinese Wikipedia article may correspond to DBpedia knowledge base of multiple languages, multiple entity type tags may be generated. Taking the entity 'juneberry' as an example, the corresponding entity types are labeled as follows:
"juneberry": counter ("{ < http:// dbpedia. org/ontology/office holder >": 2,
“<http://dbpedia.org/ontology/Person>”:2,
“<http://dbpedia.org/ontology/Politician>”:2,
“<http://dbpedia.org/ontology/President>”:1,
“<http://dbpedia.org/ontology/Scientist>”:2})
then, the type of the entity is, by mapping to the entity type to be labeled (PER, LOC, ORG): "juneberry": counter (PER: 9). For the Chinese Wikipedia article containing a plurality of different entity types, the embodiment takes the highest frequency as the entity type mark. In addition, for entities of type https:// www.w.3. org/2003/01/geo/wgs84_ pos/# spatialThing, the label is directly labeled 'LOC'. Through the above operations, for other types except the target entity type, all the types are labeled as 'OTH', then, in total 653539 chinese wikipedia articles, the number of the four types of entities is distributed as follows: [ (OTH:210412), (PER:162267), (LOC:254498), (ORG:26362) ].
Through this embodiment, 137 training corpora of ten thousand sentences are finally obtained, the first 30 ten thousand sentences are selected from the 137 training corpora, 10-fold cross validation is performed, and the named entity recognition model is trained, the result of which is shown in table 6:
Precision(%) Recall(%) F1(%)
PER 93.75 91,94 92.84
ORG 87.41 83.41 85.34
LOC 96.61 96.17 96.39
Total 92.59 90.51 91.52
TABLE 7 result List of named entity recognition model applying DBpedia ontology
Compared with the corpus construction method used in embodiment 2 (i.e., classification method based on heuristic rule + classification method based on machine learning + automatic audit), in this embodiment, only the DBpedia knowledge base ontology is used alone, and the performance of the finally trained named entity recognition model is not much different (compare table 5 and table 7). However, the corpus generated by the present embodiment is much larger than the corpus generated in embodiment 2, and the method used in the present embodiment is also described to a certain extent, and the generated corpus has a wider coverage and can recall more sentences.
Embodiment 4, named entity labeling is performed on sentences in the sentence set.
In the implementation process of constructing the corpus labeled by the named entity, when the wikipedia sentences to be labeled have no external link, the wikipedia 'zhouguern' page article describing the entity 'zhouguern' exemplifies the processing process:
finding all outlying links to the articles on the page of the Zhouguer, constructing an alias set (including the own name) of potential tagged characters,
such as { "China": [ "the people's republic of China", "China", and ", ]," taiwan ": [ "Taiwan", "Taiwan province",. -
In this example, if a certain wikipedia statement contains a ' people's republic of china ' string, then the string will be mapped with the page article of ' china ' in wikipedia.
A tree of 'trie' structure is formed by the above alias set, and the embodiment is realized by using pyacorasick (https:// pyacorasick. readthetadocs. io/en/latest /).
The document is divided into sentences, the traditional Chinese characters are converted into simplified Chinese characters, and the sentences with the length larger than 6 are reserved.
Query all possible tokens with a trie tree, for example:
"Zhou Ji Lun originated in Taiwan province of China. "
The above example sentences have no external link, and through the trie tree, the 'zhou jilun', 'china', 'taiwan province' can be queried, where the two aliases of the 'taiwan' and the 'taiwan province' are synonymous, and in this embodiment, the longest character string is selected as the final mark. Finally, for the example sentence, the coding system of 'BIESO' (where 'B' indicates the start position of the entity string, 'I' indicates the middle position of the entity string, 'E' indicates the end position of the entity string, 'S' indicates a single character entity, 'O' indicates a non-entity character) is adopted, and the obtained labeling result is as follows:
week B _ PER
Jed I _ PER
Lun E _ PER
Out of O
Raw O
In O
Middle B _ LOC
State E _ LOC
Station B _ LOC
Bay I _ LOC
Province E _ LOC
Embodiment 5, an entity dictionary is extended by applying an initial sentence tagging method.
In this embodiment, the entity dictionary is extended by applying a first sentence tagging method, and the rules for extending the entity dictionary are defined as follows:
in the first sentence of the wikipedia article, the direct object of the verb 'be' or 'be' is taken as an entity concept, for example, the example sentence "guangzhou city, abbreviated as tassel or guangdong, which is called province city, sheep city, etc., and is the province meeting of Guangdong province. ", where 'province' is the direct object of 'is', thus 'province' is chosen as the concept of an entity;
in the first sentence of the wikipedia article, when the verb ' is ' or ' is ' and the direct object is ' one ', a ' and other special words, the first noun before the special words is taken as the entity concept, for example, "peony is a kind of flower. ", 'flowers' are selected as the entity concept;
the idea of type marking is more practical and has stronger expansibility.
In the Wikipedia article, the parallel relation nouns in the first sentence can be selected as entity concepts, such as 'Beijing City, Chinese capital', 'capital' as parallel relation words of 'Beijing City', and are selected as entity concepts.
In the extraction process of the entity concept word, if a plurality of concepts are extracted, one article is allowed to correspond to the plurality of concept words. For the article with the redirection page, the entity concept expressed by the redirected article is taken as the concept of the article. After the application of the above rules, the dictionary constructed by the first sentence tagging method contains 1655402 entries in total, and the examples are as follows:
baoan sports central station and station
Baozi park station and station
Sunjiang, Shucheng people
Grandchild officer
DMG, Inc
...
Embodiment 6 expansion of entity dictionary by class Label method
Since the category information of the wikipedia article can largely correspond to the category of the entity expressed by the article, in the embodiment, the last noun phrase is selected from the category description text of the wikipedia article in the Chinese text as the category of the article. For example, for the last noun phrase in the category text 'Chinese province' to be 'province', all wikipedia articles containing the category 'Chinese province' can be added to the dictionary in the form of 'article title, province'. Examples are as follows: for the Chinese wikipedia article 'Guangdong province', because the Chinese wikipedia article contains the category label 'China province', the 'Guangdong province, province' can be added into the entity dictionary. Constructing a dictionary by a category label method to obtain 300176 items in total, wherein the examples are as follows:
milli-state city, famous city
Haozhou city, grade of land city
Lung-qi deficiency county and famous city
Lung-qi deficiency in the county and prefecture level city
...
Embodiment 7, apply DBpedia knowledge base ontology to construct dictionary.
In this embodiment, based on the labeling result obtained in embodiment 3, an entity dictionary is constructed by using the labeled entity information, and a total of 1204401 entity concepts are obtained, where the examples are as follows:
chuan bin, PER
Kunming machine tool shares of Shenji group, ORG
Vilamani, LOC
European Barbera Art center, LOC
...
Embodiment 8, verifying the optimization effect of the named entity recognition model introducing dictionary features.
After the corpus for named entity recognition is acquired, the embodiment verifies that the entity dictionary acquired by the entity dictionary expansion method (embodiments 5 and 6) is introduced as the feature of the composition feature vector, and the effect of the Chinese named entity recognition model trained on the basis is superior to that of the Chinese named entity recognition baseline model (namely, the model is trained on the training corpus marked by the named entities only, and the entity dictionary is ignored).
The external entity dictionary is introduced into the construction process of the named entity recognition model, the essence is that the entity dictionary information is added to the training sample, and the specific introduction mode is as follows: finding out character strings contained in a first column of a dictionary in a sentence and corresponding information of the character strings in the dictionary; and allocating entity dictionary features to the single character string (the single character string is characterized by words, the single character string is a single character; the word is characterized by words, the single character string is a word) in each training sentence character string.
The embodiment adopts the SVM named entity recognition model based on the character characteristics to evaluate the effect of introducing an external entity dictionary to the construction process of the named entity recognition model. The character feature-based SVM named entity recognition model adopts a feature set comprising: the character itself, one character before and after the character, two characters before and after the character, and the output label of two characters before the character. And the SVM named entity recognition model adopting the feature set is called a 'baseline' model for short. The method includes the steps that a baseline model is combined with different dictionaries, feature sets are combined, different models of 'baseline + dictionary' are constructed and are used for being compared with the baseline model, and therefore the performance of the named entity recognition model can be improved by introducing an external entity dictionary.
The corpus required for training the model is 50000 sentences of Chinese Wikipedia obtained by applying the method for constructing the corpus provided by the invention. Wherein, 90% is used as a training set, the rest 10% is used as a test set, and 10-fold cross validation is carried out on the training set. In the named entity recognition model to be constructed, the model of the dictionary constructed in embodiment 5 is added to baseleine, which is abbreviated as 'baseleine + FSC', the model of the dictionary constructed in embodiment 6 is added to baseleine, which is abbreviated as 'baseleine + CC', and the model of the dictionary constructed in embodiment 7 is added to baseleine, which is abbreviated as 'baseleine + DBO'. The results of the aforementioned 4 named entity recognition models on the test set can be shown in table 8, where 'NUM' represents a number, 'UNT' represents a unit of measure, 'TIME' represents TIME, 'Total (all)' is an average value of all entity types, 'Total (lop)' is an average value of 'LOC', 'ORG', 'PER'.
Figure BDA0002640396730000131
Figure BDA0002640396730000141
Figure BDA0002640396730000151
TABLE 8 comparison List of different named entity recognition model Performance
As is evident from the results of table 8: the named entity recognition model introduced into the external entity dictionary has better performance than the named entity recognition baseline model. The results irrevocably illustrate that introducing an external entity dictionary can indeed improve the performance of the named entity recognition model. It should be particularly noted that the entity dictionary introduced in this embodiment is automatically generated in the process of constructing the corpus labeled with named entities, and is a byproduct' of the corpus construction method provided by the present invention, and the entity dictionary is constructed from scratch without consuming a large amount of labor cost.
Example 9 named entity recognition model evaluation.
In the embodiment, a Chinese named entity recognition model is trained by combining a corpus labeled by the following three named entities, and external entity dictionary information is combined in the combined corpus, so that comparison is carried out according to recognition effects of the Chinese named entity recognition model and existing named entity recognition tools StanfordTER (https:// nlp.stanford.edu/ner /), and Hadamard LTP (http:// www.ltp-closed.com /) in a common test set. The corpus used in this embodiment includes:
1. the language database of part of speech tagging of people's daily newspaper-this language database, is formed by the news text of the people's daily newspaper in 1998, have carried on the part of speech tagging to 22722 sentences totally. Wherein, the part-of-speech labels for people, places, organizations and time are 'nr', 'ns', 'nt','t', respectively, and for the people labels, the last name and the first name are separated, the time labels are respectively marked according to the year, the month and the day, and the entity expressed by some compound noun phrases is marked by segmentation.
The SIGHANBakeoff2006NERTask corpus, which is derived from the partial data used by the 3 rd International Chinese language processing NER competition, totals 46364 sentences. Wherein, people, places and organizations are respectively marked with 'nr', 'ns' and 'nt' labels.
The Chinese Wikipedia corpus labeled by the method for constructing the corpus labeled by the named entities selects 50000 sentences from the corpus. For the first two public corpora, certain text preprocessing is required and they are converted into NER corpora labeled named entities with 'BIESO' coding system. And combining the three corpora to form a training corpus, wherein the training set consists of the first 90% of the three corpora, and the test set consists of the last 10% of the three corpora. A Chinese named entity recognition model (the model is abbreviated as 'aawantnER') is trained on the constructed training set, the performance of the model on the test set can be shown by table 9, the supported entity categories comprise six basic entities including People (PER), places (LOC), Organizations (ORG), TIME (TIME), Numbers (NUM) and measurement Units (UNT), 'Total (all)' represents the average value of the six entities, and 'Total (lop)' represents the average value of 'LOC + ORG + PER'.
Entity classes Precision(%) Recall(%) F1(%)
LOC 90.24 91.51 90.87
NUM 85.42 83.05 84.22
ORG 81.83 73.50 77.44
PER 96.59 94.77 95.67
UNT 79.00 78.24 78.62
TIME 93.33 90.90 92.10
Total(all) 87.74 85.33 86.49
Total(lop) 89.55 86.59 87.99
TABLE 9 recognition Effect of training NER model on the three corpora after merging
To further evaluate the 'aawantNER' model in this example, news data for football, current events, and other three categories, 204 in total and 68 in each category, were arbitrarily selected from the news domain. Selecting a title and a first sentence from news data of each category to form a test set, manually marking entities contained in the test set as 'group route' tags to compare 'aawantNER' and other two domestic and foreign mainstream NER tools which are respectively expressed as 'Hadamard LTP',
the effect of ` StanfordFER `. The test results measured by the 'F1' test index (the larger the value, the better the representation effect) are shown in tables 10, 11 and 12, respectively, where 'Total (all)' represents the average of six entity classes, 'Total (lopt)' represents the average of 'LOC + ORG + PER + TIME,' Total (lop) 'represents the average of' LOC + ORG + PER + TIME
Average of 'LOC + ORG + PER'. Of the three NER models compared, the highest score is indicated in 'bold'.
Newsletter StanfordNER Big LTP of worker aawantNER
Total(all) 51.36% 53.19% 62.61%
Total(lopt) 51.89% 53.88% 69.44%
Total(lop) 48.87% 49.97% 70.22%
TABLE 10 NER model Performance comparison List A
Football news StanfordNER Big LTP of worker aawantNER
Total(all) 42.03% 44.97% 52.47%
Total(lopt) 44.96% 47.88% 62.05%
Total(lop) 45.28% 49.59% 63.69%
TABLE 11 NER model Performance comparison List B
Other news StanfordNER Big LTP of worker aawantNER
Total(all) 57.80% 51.31% 60.95%
Total(lopt) 56.91% 45.50% 64.47%
Total(lop) 55.42% 40.32% 64.75%
TABLE 12 NER model Performance comparison List C
From the results in tables 10 to 12, it can be seen that: the performance of the aawantNER model is obviously superior to that of other two mainstream NER tools, which is verified from the reverse side, and by implementing the construction method of the named entity labeled corpus provided by the invention, the quality of the obtained labeled corpus is reliable and is sufficient to support the task of identifying the Chinese named entity; more importantly, the corpus for Chinese named entity recognition provided by the invention is automatically constructed, so that public external knowledge sources are effectively utilized, and manual marking which is time-consuming and labor-consuming is completely abandoned.
The corpus source selected by the invention represents the largest and most professional online encyclopedia knowledge base at present, which means that the entity types covered by the corpus source are richer and more diversified; the whole construction and labeling process is realized automatically, and expensive manual labeling cost is not required to be consumed; the quality of the obtained corpus labeled by the named entities is reliable and robust;
the 'by-product' -entity dictionary generated in the labeling process can be used as a feature to be introduced into the construction of the named entity recognition model based on machine learning so as to improve the performance of the Chinese named entity recognition model.
Has the advantages that: the method for constructing the corpus annotated by the Chinese named entity can construct the corpus annotated by the Chinese named entity through external knowledge sources such as the Chinese Wikipedia knowledge base and the DBpedia knowledge base without manually constructing and annotating text data, is short in time consumption and low in cost, reduces the dependency of a named entity identification model trained by the corpus on the field data due to the wide range of the Chinese Wikipedia knowledge base and the wide range of the Chinese Wikipedia knowledge base, and reduces the difficulty of changing and transplanting the named entity identification model.
Based on the same inventive concept, the present invention further provides a device for constructing a corpus labeled with chinese named entities, please refer to fig. 3, fig. 3 is a structural diagram of an embodiment of the device for constructing a corpus labeled with chinese named entities according to the present invention, and the device for constructing a corpus labeled with chinese named entities according to the present invention is specifically described with reference to fig. 3.
The device for constructing the corpus labeled by the Chinese named entities comprises a processor and a memory, wherein the processor is coupled with the memory;
the memory stores program data that is executed by the processor to implement the method for constructing a corpus of chinese named entity annotations as described in the above embodiments.
In this embodiment, the apparatus for constructing the corpus of the chinese named entity annotations may be a computer, a server, a cloud server, a tablet computer, a mobile phone, and other physical intelligent devices or virtual terminals.
Has the advantages that: the device for constructing the corpus annotated by the Chinese named entity can construct the corpus annotated by the Chinese named entity through external knowledge sources such as Chinese Wikipedia and DBpedia knowledge bases, does not need to manually construct and annotate text data, is short in time consumption and low in cost, reduces the dependency of a named entity recognition model trained by the corpus on the field data due to the wide involved range and the containing field of the Chinese Wikipedia, and reduces the difficulty of changing and transplanting the model.
The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.

Claims (10)

1. A method for constructing a corpus annotated by Chinese named entities is characterized by comprising the following steps of;
s101: the method comprises the steps of obtaining an article of Chinese Wikipedia, classifying the article into different entity types, wherein the classification method comprises the combination of classification based on heuristic rules, classification based on machine learning and classification based on a DBpedia knowledge base body;
s102: the article is divided into sentences to form a sentence subset, and each sentence is labeled with a named entity according to a link target of the article;
s103: and screening the sentences to form a corpus.
2. The method for constructing a corpus annotated with chinese named entities as in claim 1, wherein said heuristic rule based classification determines the categories of said articles by wikipedia categories, property box templates, cross-language links, article titles.
3. The method of claim 1, wherein the machine learning-based classification uses a first sentence, structured information, category labels, and titles of an article as a set of basic features, and the article is classified according to the set of basic features.
4. The method of constructing a corpus of chinese named entity annotations according to claim 1, wherein said DBpedia ontology based classification disambiguates entities described in said articles by DBpedia repository data.
5. The method for constructing a corpus of chinese named entity annotations according to claim 1, wherein said step of naming entities for each sentence according to a link target of said article specifically comprises:
acquiring all external links of the article, and constructing a candidate alias set according to the external links;
and constructing a trie tree through the candidate alias set, searching a mark corresponding to a sentence in the sentence set according to the trie tree, and labeling the named entity of the sentence according to the mark.
6. The method of claim 1, wherein the step of filtering the corpus of sentences to form a corpus specifically comprises:
and screening the sentence forming corpus according to the input instruction and/or the named body recognition tool.
7. The method of claim 6, wherein the step of filtering the sentence-forming corpus according to the named-entity recognition tool specifically comprises:
and identifying the label of the sentence through two different named body identification tools, and judging whether to modify the label of the sentence and whether to place the sentence into a corpus according to the identification result.
8. The method of claim 1, wherein the step of naming a corpus of sentences according to link targets of the articles further comprises:
extending an entity dictionary based on the article.
9. The method of claim 8, wherein the step of expanding the entity dictionary based on the articles specifically comprises:
retrieving articles corresponding to each seed entity based on predefined categories of the entity dictionary;
marking the seed entity according to the fine-grained category of the article;
and identifying potential named entities according to links contained in the articles corresponding to the seed entities, judging the categories of the potential named entities through a matching function, and determining whether the potential named entities can be used as entries of the entity dictionary according to a judgment result.
10. The device for constructing the corpus annotated by the Chinese named entities is characterized by comprising a processor and a memory, wherein the processor is coupled with the memory;
the memory stores program data that is executed by the processor to implement the method of constructing a corpus of chinese named entity annotations of any of claims 1-9.
CN202010838001.9A 2020-08-19 2020-08-19 Method and device for constructing corpus labeled by Chinese named entities Pending CN112182204A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010838001.9A CN112182204A (en) 2020-08-19 2020-08-19 Method and device for constructing corpus labeled by Chinese named entities

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010838001.9A CN112182204A (en) 2020-08-19 2020-08-19 Method and device for constructing corpus labeled by Chinese named entities

Publications (1)

Publication Number Publication Date
CN112182204A true CN112182204A (en) 2021-01-05

Family

ID=73919465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010838001.9A Pending CN112182204A (en) 2020-08-19 2020-08-19 Method and device for constructing corpus labeled by Chinese named entities

Country Status (1)

Country Link
CN (1) CN112182204A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599903A (en) * 2021-07-07 2023-01-13 腾讯科技(深圳)有限公司(Cn) Object tag obtaining method and device, electronic equipment and storage medium
CN116205235A (en) * 2023-05-05 2023-06-02 北京脉络洞察科技有限公司 Data set dividing method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101724398B1 (en) * 2016-01-07 2017-04-18 서강대학교산학협력단 A generation system and method of a corpus for named-entity recognition using knowledge bases
CN108520065A (en) * 2018-04-12 2018-09-11 苏州大学 Name construction method, system, equipment and the storage medium of Entity recognition corpus
CN108984661A (en) * 2018-06-28 2018-12-11 上海海乂知信息科技有限公司 Entity alignment schemes and device in a kind of knowledge mapping
CN109271530A (en) * 2018-10-17 2019-01-25 长沙瀚云信息科技有限公司 A kind of disease knowledge map construction method and plateform system, equipment, storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101724398B1 (en) * 2016-01-07 2017-04-18 서강대학교산학협력단 A generation system and method of a corpus for named-entity recognition using knowledge bases
CN108520065A (en) * 2018-04-12 2018-09-11 苏州大学 Name construction method, system, equipment and the storage medium of Entity recognition corpus
CN108984661A (en) * 2018-06-28 2018-12-11 上海海乂知信息科技有限公司 Entity alignment schemes and device in a kind of knowledge mapping
CN109271530A (en) * 2018-10-17 2019-01-25 长沙瀚云信息科技有限公司 A kind of disease knowledge map construction method and plateform system, equipment, storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐志浩: ""基于维基百科的中文命名实体语料库构建研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, no. 02, pages 138 - 4480 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599903A (en) * 2021-07-07 2023-01-13 腾讯科技(深圳)有限公司(Cn) Object tag obtaining method and device, electronic equipment and storage medium
CN115599903B (en) * 2021-07-07 2024-06-04 腾讯科技(深圳)有限公司 Object tag acquisition method and device, electronic equipment and storage medium
CN116205235A (en) * 2023-05-05 2023-06-02 北京脉络洞察科技有限公司 Data set dividing method and device and electronic equipment

Similar Documents

Publication Publication Date Title
Do et al. Developing a BERT based triple classification model using knowledge graph embedding for question answering system
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
Gui et al. Emotion cause extraction, a challenging task with corpus construction
WO2018000272A1 (en) Corpus generation device and method
RU2686000C1 (en) Retrieval of information objects using a combination of classifiers analyzing local and non-local signs
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN110781670B (en) Chinese place name semantic disambiguation method based on encyclopedic knowledge base and word vectors
Athar Sentiment analysis of scientific citations
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN112559656A (en) Method for constructing affair map based on hydrologic events
Alami et al. Hybrid method for text summarization based on statistical and semantic treatment
Fu et al. Learning semantic hierarchies: A continuous vector space approach
CN112182204A (en) Method and device for constructing corpus labeled by Chinese named entities
CN115309885A (en) Knowledge graph construction, retrieval and visualization method and system for scientific and technological service
CN106897274B (en) Cross-language comment replying method
Abarna et al. An ensemble model for idioms and literal text classification using knowledge-enabled BERT in deep learning
JP6409071B2 (en) Sentence sorting method and calculator
Rao et al. Enhancing multi-document summarization using concepts
Zhang et al. An overview on supervised semi-structured data classification
Yan et al. Sentiment analysis for microblog related to finance based on rules and classification
Ali et al. Word embedding based new corpus for low-resourced language: Sindhi
Drymonas et al. Opinion mapping travelblogs
Wu et al. Extracting Web news using tag path patterns
Lu et al. Overview of knowledge mapping construction technology
JP5506482B2 (en) Named entity extraction apparatus, string-named expression class pair database creation apparatus, numbered entity extraction method, string-named expression class pair database creation method, program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210826

Address after: No. 112, 207 Dunhe Road, Haizhu District, Guangzhou, Guangdong 510000

Applicant after: Hu Lin

Address before: Room 2, 112, 207 Dunhe Road, Haizhu District, Guangzhou, Guangdong 510000

Applicant before: Guangdong Huiyin Trading Co.,Ltd.