CN110555206A - named entity identification method, device, equipment and storage medium - Google Patents

named entity identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN110555206A
CN110555206A CN201810556103.4A CN201810556103A CN110555206A CN 110555206 A CN110555206 A CN 110555206A CN 201810556103 A CN201810556103 A CN 201810556103A CN 110555206 A CN110555206 A CN 110555206A
Authority
CN
China
Prior art keywords
new
new field
field
text data
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810556103.4A
Other languages
Chinese (zh)
Inventor
温海娇
陈虹
牛国扬
董修岗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201810556103.4A priority Critical patent/CN110555206A/en
Priority to PCT/CN2019/089325 priority patent/WO2019228466A1/en
Publication of CN110555206A publication Critical patent/CN110555206A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a named entity recognition method, a device, equipment and a storage medium, which relate to the fields of natural language processing, semantic analysis and understanding, artificial intelligence and the like, and the method comprises the following steps: entity recognition is carried out on the new field text data to obtain new field seed entity words; according to the new field seed entity words, marking the new field text data to obtain marked new field text data; training a named entity recognition model by using the labeled new field text data to obtain a named entity recognition model suitable for the new field; and identifying entity words in other text data of the new field by using the named entity identification model applicable to the new field. The embodiment of the invention can reduce the workload of data marking, reduce the threshold of model migration training and improve the field universality of the algorithm.

Description

Named entity identification method, device, equipment and storage medium
Technical Field
The present invention relates to the fields of natural language processing, semantic analysis and understanding, artificial intelligence, etc., and in particular, to a method, an apparatus, a device, and a storage medium for NER (Named entity Recognition).
Background
Named entity recognition is a basic branch of NLP (Natural Language Processing) and is also one of the key technologies in information extraction, and is directed to proper nouns in the field. The general fields mainly include: name of person, name of place, name of organization, etc. The specific field mainly refers to the proper terms in the field, such as "credit card", "debit card", etc. in the banking field.
The prior art can be divided into three categories, namely a method based on a dictionary and a rule, which depends on the construction of the dictionary and the rule and has great limitation in processing new words and new fields; secondly, a method based on statistics: the method depends on manual feature selection, and a large amount of manpower and time are needed; thirdly, the workload of artificial feature selection is reduced based on a deep learning method. The statistical method and the deep learning method have good performance in the named entity recognition task. However, there are two disadvantages in the practical application process: 1) a large amount of labeled data is needed, and the manual workload is large; 2) the model field has poor mobility, and a large-scale data set needs to be re-labeled during field switching.
Disclosure of Invention
The embodiment of the invention provides a named entity identification method, a named entity identification device, named entity identification equipment and a storage medium, and solves the problems of large workload of data labeling and difficult field migration.
According to the named entity recognition method provided by the embodiment of the invention, the method comprises the following steps:
Carrying out entity recognition on the new field text data by utilizing an algorithm for mining seed entity words to obtain new field seed entity words;
according to the new field seed entity words, marking the new field text data to obtain marked new field text data;
Training a named entity recognition model by using the labeled new field text data to obtain a named entity recognition model suitable for the new field;
And identifying entity words in other text data of the new field by using the named entity identification model applicable to the new field.
according to an embodiment of the present invention, an apparatus for identifying a named entity is provided, the apparatus including:
the entity recognition module is used for carrying out entity recognition on the new field text data by utilizing an algorithm for mining the seed entity words to obtain the new field seed entity words;
The text marking module is used for marking the new field text data according to the new field seed entity words to obtain marked new field text data;
The model training module is used for training a named entity recognition model by utilizing the labeled new field text data to obtain the named entity recognition model suitable for the new field;
and the model application module is used for identifying entity words in other text data of the new field by utilizing the named entity identification model suitable for the new field.
According to an embodiment of the present invention, a named entity recognition apparatus is provided, which includes: a processor, and a memory coupled to the processor; the memory has stored thereon a named entity recognition program executable on the processor, the named entity recognition program, when executed by the processor, implementing the steps of the named entity recognition method.
According to an embodiment of the present invention, a computer storage medium is provided, on which a named entity recognition program is stored, and the named entity recognition program implements the steps of the named entity recognition method when executed by a processor.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
According to the embodiment of the invention, by mining the seed entity words in the new field and labeling the text data in the new field, the data labeling workload is reduced, the model migration training threshold is reduced, and the field universality of the algorithm is improved.
drawings
Fig. 1 is a flowchart of a named entity recognition method according to an embodiment of the present invention;
Fig. 2 is a block diagram of a named entity recognition apparatus according to an embodiment of the present invention;
fig. 3 is a block diagram of a named entity recognition device according to an embodiment of the present invention;
FIG. 4 is an entity identification system architecture diagram provided by an embodiment of the present invention;
FIG. 5 is an architecture diagram of an automated entity recognition system provided by an embodiment of the present invention;
fig. 6 is a flowchart of discovering new words and mining seed entity words according to an embodiment of the present invention;
FIG. 7 is a flow chart of a new word discovery algorithm provided by an embodiment of the present invention;
FIG. 8 is a flowchart of the sentence pattern mining of seed entity words provided by the embodiment of the present invention;
FIG. 9 is a flowchart of a sentence mining algorithm provided by an embodiment of the present invention;
FIG. 10 is a diagram illustrating a domain concept graph structure provided by an embodiment of the present invention;
FIG. 11 is a flow chart of automatic corpus marking according to an embodiment of the present invention;
FIG. 12 is a flow chart of semi-automated entity identification of new domain only of example 1 of the present invention;
FIG. 13 is a flow chart of semi-automated entity identification combining existing and new domains in accordance with example 2 of the present invention.
Detailed Description
the preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings, and it should be understood that the preferred embodiments described below are only for the purpose of illustrating and explaining the present invention, and are not to be construed as limiting the present invention.
Fig. 1 is a flowchart of a named entity identification method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:
Step S101: and carrying out entity recognition on the new field text data to obtain new field seed entity words.
The new field refers to a field in which entity words are not mined or a field in which entity words are not mined sufficiently, and in the new field in which entity words to be mined are not marked or lack.
As a first way to implement step S101, the new-field text data may be split into new-field clauses, then a new word in each new-field clause is determined according to the allowable length of the new-field entity word, and the new word is filtered according to the correlation between the new word and the new field, so as to obtain the new-field seed entity word. The new field seed entity words are obtained by utilizing an algorithm for mining the seed entity words, specifically an algorithm for determining the new field seed entity words based on the found new words, such as a Nagao algorithm, and performing entity recognition on the new field text data.
As an embodiment, the determining new words in each new domain clause according to the allowed length of the new domain entity word includes: and for each new field single sentence, counting phrases meeting the allowable length of the new field entity words in the new field single sentence, and filtering each phrase according to the characteristics of each phrase to obtain a new word.
The allowable length of the new domain entity word can be any length which is not more than the preset longest length of the new domain entity.
Wherein, the feature of each phrase can be word frequency, word part, etc.
Wherein, filtering each phrase refers to filtering by using the feature and the experience threshold, for example, taking the phrase with the comparison word frequency greater than the known experience word frequency as a new word, and for example, determining the new word by using the average feature value and the known experience word frequency obtained by calculating the feature value and the weight of each feature.
As an embodiment, the filtering the new word according to the relevance of the new word to the new domain includes: and determining a correlation score representing the correlation between the new word and the new field by using a field concept graph representing the correlation between the fields and a word frequency-reverse document frequency algorithm, and filtering the new word according to the correlation score and an experience threshold value to obtain the new word of which the correlation score is higher than the experience threshold value as the seed entity word of the new field. Wherein the step of determining the relevance score comprises: and according to the domain concept map, obtaining the relevance weight of the new domain and other domains, determining the probability score representing the importance degree of the new word to the new domain by using the word frequency-reverse document frequency algorithm, and then determining the relevance score of the new word and the new domain by using the relevance weight and the probability score.
the domain concept graph can organize and represent the relationship between domains, such as the upper and lower relationship, and is a graphical representation of the relationship between domains.
the word frequency-reverse document frequency algorithm is an existing algorithm, common words in each field are filtered out by using the algorithm, and important words in a new field are reserved as seed entity words in the new field. That is, the new word may appear in the text data of other fields, and in the text data set composed of the text data of the new field and the text data of other fields, the algorithm can determine the importance degree of the new word to the text data of the new field in the text data set, thereby realizing the filtering of the new word.
The other fields may be any fields different from the upper field, for example, the upper field is finance, and the other fields may be operators, science and technology, and the like.
This approach is applicable to scenes with only new domain text data.
as a second way to implement step S101, the new field text data and the existing field text data may be split into a new field single sentence and an existing field single sentence, respectively, and then a sentence pattern template is generated by using the existing field single sentence, and the new field seed entity word in the new field single sentence is determined by matching the new field single sentence with the sentence pattern template. And performing entity recognition on the text data of the new field by utilizing an algorithm for mining the seed entity words, specifically an algorithm for mining the seed entity words of the new field by utilizing a sentence pattern template to obtain the seed entity words of the new field.
as an embodiment, the sentence pattern template includes a first sentence pattern template and a second sentence pattern template, and the generating the sentence pattern template using the existing field single sentence and the existing field entity word includes: and replacing existing field entity words existing in each existing field single sentence with preset entity word mining symbols to obtain a first sentence pattern template, and replacing words or phrases in the first sentence pattern template with synonyms or synonyms to obtain a second sentence pattern template. The first sentence pattern template and the second sentence pattern template are seed sentence pattern templates. As another example, the sentence pattern template may further include a third sentence pattern template, which is a sentence pattern template derived from the seed sentence pattern template, and the derivation process may be implemented by using a self-expanding technique (i.e., Bootstrapping algorithm), for example.
and when the existing field and the new field are similar fields, for example, the upper fields to which the existing field and the new field belong are the same, and a sentence pattern template more suitable for the new field can be generated. Therefore, when preparing the linguistic data of the existing field, the similar field of the new field can be determined through the field concept map, and then the linguistic data of the similar field is prepared. For example, when generating a sentence pattern template for mining seed entity words of a building bank (i.e., building a bank), text data of other banks is preferentially selected to generate the sentence pattern template.
This approach is applicable to scenarios lacking sufficient new domain text data or other domain annotation text.
As a third embodiment for implementing step S101, it combines the first and second embodiments, and specifically includes: splitting the new field text data and the existing field text data into new field single sentences and existing field single sentences respectively, then determining new words in each new field single sentence according to the allowable length of the new field entity words, and filtering the new words according to the relevance of the new words to other fields to obtain filtered new field seed entity words; generating a sentence pattern template by utilizing the existing field single sentence, and obtaining a matched new field seed entity word by matching the new field single sentence with the sentence pattern template; and finally, merging the filtered new field seed entity words and the matched new field seed entity words to obtain the new field seed entity words. The new field seed entity words are obtained by utilizing an algorithm for mining the seed entity words, specifically an algorithm for determining the seed entity words in the new field based on the found new words and an algorithm for mining the seed entity words in the new field by utilizing a sentence pattern template, and entity recognition is carried out on the text data in the new field.
The step of determining new words and filtering the new words may adopt the implementation manner mentioned in the first manner, and the step of generating the sentence pattern template may adopt the implementation manner mentioned in the second manner, which is not described herein again.
The seed entity word in the new field of this embodiment is a typical entity word in the new field, and is an initial condition for finding other entity words in the new field, that is, the seed entity word is used to realize the expansion of the entity word in the new field.
Step S102: and labeling the new field text data according to the new field seed entity words to obtain labeled new field text data.
as a way to implement step S102, for each new field single sentence, the new field single sentence is subjected to word segmentation processing according to words, to obtain words constituting the new field single sentence, then according to the position of each word in the new field seed entity words contained in the new field single sentence, each word of the new field single sentence is labeled, and after all the new field single sentences are labeled, labeled new field text data is obtained.
The word may be a chinese character when identifying a named entity in chinese, or may be a minimum unit of a single sentence constituting the language, such as a word in english, when identifying a named entity in another language.
Step S103: and training a named entity recognition model by using the labeled new field text data to obtain the named entity recognition model suitable for the new field.
Step S104: and identifying entity words in other text data of the new field by using the named entity identification model applicable to the new field.
As an example, the named entity recognition model may be the NER model based on the deep learning framework two-way LSTM (Long Short-Term Memory) + CRF (Conditional Random Field) commonly used in the industry.
As another implementation manner, after step S101, the seed entity word may be further sent to a user interface, so that a user may manually check the seed entity word.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, and the program may be stored in a computer readable storage medium. Furthermore, the present invention may also provide a computer storage medium having a named entity recognition program stored thereon, the named entity recognition program, when executed by a processor, implementing the steps of the named entity recognition method. The storage medium may include ROM/RAM, magnetic disk, optical disk, and U disk.
Fig. 2 is a block diagram of a named entity recognition apparatus according to an embodiment of the present invention, and as shown in fig. 2, the apparatus includes:
And the entity recognition module is used for performing entity recognition on the new field text data by utilizing an algorithm for mining the seed entity words to obtain the new field seed entity words, and the step S101 in the figure 1 can be realized.
And the text marking module is used for marking the new field text data according to the new field seed entity words to obtain marked new field text data, and the step S102 in the step 1 can be realized.
And the model training module is used for training the named entity recognition model by using the labeled new field text data to obtain the named entity recognition model suitable for the new field, and the step S103 in the step 1 can be realized.
And a model application module, configured to identify entity words in other text data of the new domain by using the named entity identification model applicable to the new domain, where step S104 in fig. 1 may be implemented.
Fig. 3 is a block diagram of a named entity recognition device according to an embodiment of the present invention, and as shown in fig. 3, the device includes: a processor, and a memory coupled to the processor; the memory has stored thereon a named entity recognition program executable on the processor, the named entity recognition program, when executed by the processor, implementing the steps of the named entity recognition method.
The invention mainly comprises four modules: mining seed entity words (equivalent to the entity recognition module of fig. 2), automatic marking of corpus (equivalent to the text marking module of fig. 2), off-line training of the NER model (equivalent to the model training module of fig. 2), and on-line use of the NER model (equivalent to the model application module of fig. 2). Wherein:
1. Digging seed entity words
the module can effectively solve the problems of field migration and lack of standard corpora in entity identification. Is a core module and comprises two sub-modules: the new word discovery and the sentence pattern are used for mining the seed entity words.
The new word discovery method is suitable for the entity recognition scene with only the linguistic data of the new field, such as: entity identification is performed on the telecommunication field, but scenes of telecommunication or other field labeling corpora are lacked. The sentence pattern mining method is suitable for the entity recognition scene combining the existing field and the new field, such as: a certain amount of entity word banks exist in the row building field, and middle-row or telecommunication entity words can be rapidly mined by means of sentence patterns.
2. automatic marking of corpus
The automatic mark module of beating of corpus after excavating seed entity word, the system is automatic marks new domain corpus, avoids loaded down with trivial details artifical mark work, provides data support for NER model training, is another core module of system.
3. Offline training NER model
The module trains an NER model by means of a bidirectional LSTM (Long Short-Term Memory) and CRF (Conditional Random Field) universal in the industry, is used for improving the generalization capability of the entity recognition system, and is a necessary module of the system.
4. Online use of NER model
the module is an essential module of the system, is not a core module, and has the same use flow as the NER model in the industry.
The present invention will be further described with reference to fig. 4 to 13 for facilitating understanding of those skilled in the art, and the following description is not intended to limit the scope of the present invention.
Fig. 4 is an architecture diagram of an entity recognition system according to an embodiment of the present invention, and as shown in fig. 4, the system is a system for implementing a chinese entity recognition algorithm that is common in a set of fields, and solves the problems of large workload of data labeling and difficulty in field migration. Mainly comprises four modules: and (3) mining seed entity words, automatically marking corpora, training an NER model off line and using the NER model on line.
The online use of the NER model ensures the completeness of the system, and the non-core module is the same as the use flow of the NER model in the industry. The offline training NER model is used for improving the generalization capability of the entity recognition system by means of the bidirectional LSTM + CRF of the deep learning framework commonly used in the industry. The automatic marking of seed entity words and corpora is excavated, and the problem of field migration can be effectively solved. Is the core module. To further improve the system accuracy, manual verification may be introduced before automatic marking of the corpus, as shown in the semi-automatic entity recognition system architecture diagram of fig. 5. The system can be applied to various devices, such as an intelligent call center, an intelligent set top box, an intelligent knowledge base and the like, the accuracy rate of the devices is improved, and the manual workload is reduced.
Fig. 6 is a flowchart of discovering and mining seed entity words for new words according to an embodiment of the present invention, and as shown in fig. 6, the flowchart is applicable to an entity recognition scenario with only new domain corpora, for example: entity identification is performed on the telecommunication field, but scenes of telecommunication or other field labeling corpora are lacked. The method mainly utilizes a new word discovery algorithm to mine entity words in the seed field, and can be quickly applied to products or a subsequent entity recognition model training process. The new field only needs to provide a corresponding corpus and does not need to be re-labeled, so the method is one of the core modules of the invention.
Preparing a corpus: the text data in the new field may be an FAQ (Frequently Asked Question) Question-answer pair, or a chapter and a text corpus.
Interface entry (i.e., entry of parameters from the interface): and generating a parameter entering message containing the input parameters by using the prepared text data, the prepared field information and the longest length of the field entity as the input parameters, and inputting the parameter entering message through the interface. For example, the join message may be a json (JS Object Notation) message, which mainly includes the domain where the corpus is located, the text data, and the longest length of the domain entity word, and the specific format is as follows:
the respective steps of fig. 6 are described in detail below.
Step 301: and acquiring the linguistic data of the new field.
And extracting the linguistic data of the new field (namely the text data of the new field) in the reference message (such as json message) and storing the linguistic data in a cache.
Step 302: and splitting the clauses.
And splitting the text data into single sentences according to punctuation marks and stop word stop phrases.
step 303: and (5) mining new words.
the implementation flow is shown in fig. 7, and includes: step 401: obtaining punctuation texts (namely clauses or single sentences); step 402: counting all phrases with the length (namely the longest length of the entity in the field) in each sentence-breaking text; step 403: counting the characteristics of each phrase; step 404: according to the characteristics of each phrase, threshold filtering is performed to obtain a final new word as a candidate new word in step 405. I.e., the phrase combinations that first statistically match the length. Then, the characteristics of each phrase, such as mutual information, left and right information entropy, word frequency, word property and the like, are counted to obtain candidate new words. And finally filtering according to an empirical threshold to obtain a final new word.
step 304: and filtering the new words.
whether the new words are related to other industries or not is mainly judged for filtering. The relevance of the new words and other industries is realized by means of a domain concept map and a keyword extraction algorithm. Corpora of various industries are derived from web crawlers. The upper and lower relationship of the concept map is mainly used to determine the industry of the current field, and the storage structure diagram is shown in fig. 10. And judging the importance degree of each new word in the current field by a TF-IDF (term Frequency-Inverse Document Frequency) keyword extraction algorithm, and taking the TF-IDF as the probability score of the entity word. And finally obtaining the entity word result according to an empirical threshold.
Step 305: and outputting the seed entity words.
Generating json information of the field entity words, wherein the json information is a core module, and is convenient for log packet capture and information acquisition. The message format is as follows:
Wherein "zxner _ domain" is the domain of the entity, "zxner _ result" is the result of entity recognition, and is in the form of an array, and contains the entity word and the score corresponding to the word.
Fig. 8 is a flow chart of sentence mining for seed entity words according to an embodiment of the present invention, as shown in fig. 8, the flow chart is applicable to an entity recognition scenario combining an existing field and a new field, for example: a certain amount of entity word banks exist in the row building field, and middle-row or telecommunication entity words can be rapidly mined by means of sentence patterns. The method mainly utilizes the sentence structure to mine the entity words in the field, and can be quickly applied to products or the subsequent entity recognition model training process. The new scene expansion only needs to provide a corresponding corpus without re-labeling, and therefore, the method is one of the core modules of the invention.
preparing a corpus: the new field text data can be FAQ question and answer pairs, and can also be sections and text corpora. The entity word stock and the text data in the prior art are used for sentence pattern mining.
interface entry (i.e., entry of parameters from the interface): and generating a parameter entering message containing the input parameters by taking the prepared new field text data, the new field information, the longest length of the new field entity, the existing field text data, the existing field information and the existing field entity word as the input parameters, and inputting the parameter entering message through an interface. For example, the join message may be a json message, which mainly includes the field where the corpus is located, and the longest length of the entity word in the text data field, and the specific format is as follows:
The respective steps of fig. 8 are described in detail below.
step 501: and acquiring the linguistic data of the existing field, including entity words.
And extracting the existing domain text data and entity words in the reference message.
Step 502: and splitting the clauses.
And splitting the existing field text data into single sentences according to punctuation marks and stop word stop phrases.
step 503: and excavating a sentence pattern.
The implementation flow is shown in fig. 9. Step 601: obtaining punctuation texts (namely clauses or single sentences); step 602: replacing entity words in the sentence-breaking text with [ E ] (namely, a preset entity word mining symbol); step 603: replacing other words or phrases except the entity words in the sentence breaking text with synonyms or synonyms; step 604: obtaining a seed sentence pattern template after replacement; step 605: using the Bootstrapping algorithm, more sentence templates are obtained in step 606. I.e., the entity words in the single sentence of step 502 are first replaced with E. And then replacing the resulting seed sentence template with a synonym or synonym phrase. And finally, excavating more sentence pattern templates by adopting a bootstrapping algorithm.
Step 504: and storing the sentence pattern template.
the structure of the sentence template formed in step 503 is stored as follows:
Stencil (pattern) fields (domian)
[E]How to do Building element
step 505: and acquiring the linguistic data of the new field.
And extracting the new field text data in the participation message.
step 506: and splitting the clauses.
And splitting the new field text data into single sentences according to punctuation marks and stop word stop phrases.
Step 507: and matching the sentence patterns.
and sorting according to the relevance between the new field and the existing field, matching sentence pattern templates and extracting possible entity words.
the correlation in the prior art is mainly realized by a concept knowledge graph, and a storage structure of the concept graph is shown in a figure 10. The domain correlation depends on the contents of two parts in the map, namely, the upper and lower relation between industries, the domain with the same upper word and the maximum correlation, such as before a bank is built; second, sentence structure similarity relationships between industries, such as finance and operators.
The module 508: and outputting the seed entity words.
Generating json information of the field entity words, wherein the json information is a core module, and is convenient for log packet capture and information acquisition. The message format is as follows:
Wherein "zxner _ domain" is the domain of the entity, "zxner _ result" is the result of entity recognition, and is in array form, and contains the entity word and the template corresponding to the word.
fig. 11 is a flowchart of the automatic marking of the corpus according to the embodiment of the present invention, and as shown in fig. 11, the module marks the corpus of the new field automatically (i.e., marks text data of the new field) according to the seed entity word, so as to reduce the manual marking work.
Besides serving the system, the corpus automatic marking process can be applied to other sequence marking task systems, such as a word segmentation system and the like. When applied to other systems, the interface is involved (i.e. parameters are input from the interface): and taking the field of the linguistic data, the entity words and the corresponding linguistic data as input parameters, generating a parameter input message containing the input parameters, and inputting the parameter input message through an interface. For example, the join message may be a json message, which mainly includes the domain where the corpus is located, the entity word and the corresponding corpus. The specific format is as follows:
The respective steps of fig. 11 are described in detail below.
Step 801: and acquiring the linguistic data of the new field.
And extracting the new field linguistic data in the parameter information, wherein the new field linguistic data comprises field text data and seed entity words.
Step 802: and splitting the clauses.
And splitting the existing field text data into single sentences according to punctuation marks and stop word stop phrases.
step 803: and dividing words according to characters.
All clauses are divided into words according to characters, and the influence of word division errors on the results of the system is reduced.
step 804: and judging the position of the character in the seed entity word, wherein the initial position is marked as B, the middle position is marked as I, the end position is marked as E, and the position is not marked as O in the entity word.
And the output is that json information of the labeled linguistic data is generated, so that the information is conveniently acquired by capturing the log. The message format is as follows:
Wherein "zxner _ domain" is the domain of the entity, "zxner _ result" is the corpus tagged result, which is in array form, and contains the single sentence corpus and the tagged result of each word.
fig. 12 is a flow chart of semi-automated entity identification of the new field only in example 1 of the present invention, and as shown in fig. 12, this embodiment mainly illustrates the following application scenario: only new domain entities are identified.
Step 901: and acquiring the linguistic data of the new field.
receiving the income parameter message, and extracting new domain corpora from the income parameter message, wherein the new domain corpora includes new domain text data including? dry overflow yield of a financial product,? dry overflow yield of a financial product which is stable,? and new domain information of 'bank construction'.
Step 902: and (4) discovering seed entity words by the new words.
1) And splitting a clause into results:
'Qianyuan overflow' is a financing product
How much things go about in terms of the overflow of Qianyuan
it is a robust financial product if the drily overflow.
2) And (3) new words are mined:
Qianyuan, Qianyuan overflow, financing product, income
3) Filtering new words:
The current data is determined to belong to the financial field through a field concept map, and the TF-IDF calculates the relevance of the new words to other industries such as operators, science and technology and the like. The larger the score is, the higher the correlation with the building field is, and the correlation with other fields is low.
stem 2.34
Overflowing of dried meat 2.12
Financing product 2.08
And (4) yield: 1.834) mining seed entity words, and outputting the result as:
step 903: and manually checking to determine the entity words.
Product for promoting the circulation of qi and blood
Step 904: automatic marking of corpus: the corpus is labeled by classical BIES in the industry.
Step 905: and (5) training the model.
In the same way as in the industry, it is not described in detail here.
Fig. 13 is a flow chart of semi-automated entity identification combining the existing domain and the new domain in example 2 of the present invention, and as shown in fig. 13, this embodiment mainly illustrates the following application scenarios: entity identification combining existing domain and new domain.
Step 1001: and acquiring a new field corpus and an existing field corpus.
receiving the income-joining message, and extracting new field linguistic data from the income-joining message, wherein the new field textual data comprises new field text data ' credit card password modification? ' which? credit card is required to be handled and? ' of debit card, new field information ' creation of a bank ', existing field information ' telecommunication ', existing field entity word ' Tian Yi navigation A8 package, business navigation package, campus package ' and existing field text data ' which? Tian Yi navigation A8 package is required to be handled and? '.
Step 1002: new word discovery + sentence pattern mining of seed entity words.
1) And splitting a clause into results:
Splitting a new field corpus clause result:
Credit card password modification
what is the way the debit card is transacted
how the credit card is handled,
the existing domain corpus clause splitting result:
what is the handling way of the commercial navigation package
How to handle the Tian Yi navigation A8 combo
2) mining new words and sentence patterns
And (3) new words are mined: credit card
excavating sentence patterns: 【E】 What is the way of handling (E)
【E】 What is the opening mode of (E)
3) filtering new words: the current data is determined to belong to the financial field through a field concept map, and the TF-IDF calculates the relevance of the new words to other industries such as operators, science and technology and the like. The larger the score is, the higher the relevance to the construction field is, i.e., the low relevance to other fields.
Credit card: 2.39
4) Sentence pattern matching: a domain concept map belongs to the financial field, has high relevance with the telecommunication field, and can match sentence pattern templates to obtain results.
【E】 What is the transaction mode of (A) is
【E】 How to handle credit card
5) The output result of the seed entity words is mined as follows:
step 1003: and manually checking to determine the entity words.
credit card, debit card
Step 1004: automatic marking of corpus: the corpus is labeled by classical BIES in the industry.
5) the 1005 module trains the model: in the same way as in the industry, it is not described in detail here.
In summary, the embodiments of the present invention have the following technical effects:
Compared with other traditional algorithms, the system provided by the embodiment of the invention is additionally provided with an entity seed collection mining module (realizing the function of the entity recognition module in the figure 2), the technology of new word discovery, keyword extraction, sentence pattern mining, domain concept map and the like is adopted to automatically mine the seed entity collection in the new field, the materials are automatically marked, then a deep learning bidirectional LSTM + CRF algorithm training model is carried out, the data marking workload can be reduced, the model migration training threshold is reduced, the domain universality of the algorithm is improved, the system is suitable for various scenes, and various Artificial Intelligence (AI) devices such as a voice assistant, an intelligent customer service and an intelligent knowledge base can be embedded.
Although the present invention has been described in detail hereinabove, the present invention is not limited thereto, and various modifications can be made by those skilled in the art in light of the principle of the present invention. Thus, modifications made in accordance with the principles of the present invention should be understood to fall within the scope of the present invention.

Claims (10)

1. A named entity recognition method, comprising:
Entity recognition is carried out on the new field text data to obtain new field seed entity words;
according to the new field seed entity words, marking the new field text data to obtain marked new field text data;
Training a named entity recognition model by using the labeled new field text data to obtain a named entity recognition model suitable for the new field;
and identifying entity words in other text data of the new field by using the named entity identification model applicable to the new field.
2. The method of claim 1, wherein the performing entity recognition on the new domain text data to obtain new domain seed entity words comprises:
Splitting the new field text data into new field single sentences;
Determining a new word in each new field single sentence according to the allowable length of the new field entity word;
And filtering the new words according to the correlation between the new words and the new fields to obtain the seed entity words of the new fields.
3. The method of claim 1, wherein the performing entity recognition on the new domain text data to obtain new domain seed entity words comprises:
splitting the new field text data and the existing field text data into a new field single sentence and an existing field single sentence respectively;
Generating a sentence pattern template by using the existing field single sentence;
and determining the new field seed entity words in the new field single sentence by matching the new field single sentence with the sentence pattern template.
4. The method of claim 1, wherein the performing entity recognition on the new domain text data to obtain new domain seed entity words comprises:
Splitting the new field text data and the existing field text data into a new field single sentence and an existing field single sentence respectively;
Determining a new word in each new field single sentence according to the allowable length of the new field entity word, and filtering the new word according to the correlation between the new word and the new field to obtain a filtered new field seed entity word;
Generating a sentence pattern template by utilizing the existing field single sentence, and obtaining a matched new field seed entity word by matching the new field single sentence with the sentence pattern template;
And merging the filtered new field seed entity words and the matched new field seed entity words to obtain the new field seed entity words.
5. The method according to claim 2 or 4, wherein the filtering the new word according to the relevance of the new word to the new domain comprises:
determining a correlation score representing the correlation between the new word and the new field by using a field concept map representing the correlation between the fields and a word frequency-reverse document frequency algorithm;
And filtering the new words according to the correlation scores and experience thresholds to obtain the new words with the correlation scores higher than the experience thresholds as the new field seed entity words.
6. The method of claim 5, wherein determining the relevance score characterizing the relevance of the new word to the new domain using a domain concept graph characterizing the relevance between domains and a word frequency-inverse document frequency algorithm comprises:
Acquiring the correlation weight of the new field and other fields according to the field concept map;
Determining a probability score representing the importance degree of the new word to the new field by using the word frequency-reverse document frequency algorithm;
and determining a relevance score of the new word and the new field by using the relevance weight and the probability score.
7. The method according to any one of claims 2 to 4, wherein the labeling the new domain text data according to the new domain seed entity word to obtain labeled new domain text data comprises:
for each new field single sentence, performing word-based word segmentation processing on the new field single sentence to obtain characters forming the new field single sentence;
Marking each character of the new field single sentence according to the position of each character in the new field seed entity words contained in the new field single sentence;
And marking all the new field single sentences to obtain marked new field text data.
8. A named entity recognition device, wherein the device comprises:
The entity recognition module is used for carrying out entity recognition on the new field text data to obtain new field seed entity words;
the text marking module is used for marking the new field text data according to the new field seed entity words to obtain marked new field text data;
The model training module is used for training a named entity recognition model by utilizing the labeled new field text data to obtain the named entity recognition model suitable for the new field;
and the model application module is used for identifying entity words in other text data of the new field by utilizing the named entity identification model suitable for the new field.
9. a named entity recognition device, wherein the device comprises: a processor, and a memory coupled to the processor; the memory has stored thereon a named entity recognition program executable on the processor, the named entity recognition program, when executed by the processor, implementing the steps of the named entity recognition method according to any one of claims 1 to 7.
10. A computer storage medium, having stored thereon a named entity recognition program which, when executed by a processor, carries out the steps of the named entity recognition method according to any one of claims 1 to 7.
CN201810556103.4A 2018-06-01 2018-06-01 named entity identification method, device, equipment and storage medium Pending CN110555206A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810556103.4A CN110555206A (en) 2018-06-01 2018-06-01 named entity identification method, device, equipment and storage medium
PCT/CN2019/089325 WO2019228466A1 (en) 2018-06-01 2019-05-30 Named entity recognition method, device and apparatus, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810556103.4A CN110555206A (en) 2018-06-01 2018-06-01 named entity identification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110555206A true CN110555206A (en) 2019-12-10

Family

ID=68698713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810556103.4A Pending CN110555206A (en) 2018-06-01 2018-06-01 named entity identification method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110555206A (en)
WO (1) WO2019228466A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969021A (en) * 2019-12-23 2020-04-07 竹间智能科技(上海)有限公司 Named entity recognition method, device, equipment and medium in single-round conversation
CN111062216A (en) * 2019-12-18 2020-04-24 腾讯科技(深圳)有限公司 Named entity identification method, device, terminal and readable medium
CN111178076A (en) * 2019-12-19 2020-05-19 成都欧珀通信科技有限公司 Named entity identification and linking method, device, equipment and readable storage medium
CN111241839A (en) * 2020-01-16 2020-06-05 腾讯科技(深圳)有限公司 Entity identification method, entity identification device, computer readable storage medium and computer equipment
CN111597813A (en) * 2020-05-21 2020-08-28 上海创蓝文化传播有限公司 Method and device for extracting text abstract of short message based on named entity identification
CN111832291A (en) * 2020-06-02 2020-10-27 北京百度网讯科技有限公司 Entity recognition model generation method and device, electronic equipment and storage medium
CN113887227A (en) * 2021-09-15 2022-01-04 北京三快在线科技有限公司 Model training and entity recognition method and device

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553158A (en) * 2020-04-21 2020-08-18 中国电力科学研究院有限公司 Method and system for identifying named entities in power scheduling field based on BilSTM-CRF model
CN111967266B (en) * 2020-09-09 2024-01-26 中国人民解放军国防科技大学 Chinese named entity recognition system, model construction method, application and related equipment
CN113127503A (en) * 2021-03-18 2021-07-16 中国科学院国家空间科学中心 Automatic information extraction method and system for aerospace information

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9836453B2 (en) * 2015-08-27 2017-12-05 Conduent Business Services, Llc Document-specific gazetteers for named entity recognition
CN107133220B (en) * 2017-06-07 2020-11-24 东南大学 Geographic science field named entity identification method
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062216A (en) * 2019-12-18 2020-04-24 腾讯科技(深圳)有限公司 Named entity identification method, device, terminal and readable medium
CN111178076A (en) * 2019-12-19 2020-05-19 成都欧珀通信科技有限公司 Named entity identification and linking method, device, equipment and readable storage medium
CN111178076B (en) * 2019-12-19 2023-08-08 成都欧珀通信科技有限公司 Named entity recognition and linking method, device, equipment and readable storage medium
CN110969021A (en) * 2019-12-23 2020-04-07 竹间智能科技(上海)有限公司 Named entity recognition method, device, equipment and medium in single-round conversation
CN111241839A (en) * 2020-01-16 2020-06-05 腾讯科技(深圳)有限公司 Entity identification method, entity identification device, computer readable storage medium and computer equipment
CN111241839B (en) * 2020-01-16 2022-04-05 腾讯科技(深圳)有限公司 Entity identification method, entity identification device, computer readable storage medium and computer equipment
CN111597813A (en) * 2020-05-21 2020-08-28 上海创蓝文化传播有限公司 Method and device for extracting text abstract of short message based on named entity identification
CN111832291A (en) * 2020-06-02 2020-10-27 北京百度网讯科技有限公司 Entity recognition model generation method and device, electronic equipment and storage medium
CN111832291B (en) * 2020-06-02 2024-01-09 北京百度网讯科技有限公司 Entity recognition model generation method and device, electronic equipment and storage medium
CN113887227A (en) * 2021-09-15 2022-01-04 北京三快在线科技有限公司 Model training and entity recognition method and device

Also Published As

Publication number Publication date
WO2019228466A1 (en) 2019-12-05

Similar Documents

Publication Publication Date Title
CN110555206A (en) named entity identification method, device, equipment and storage medium
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
CN109829155B (en) Keyword determination method, automatic scoring method, device, equipment and medium
CN111222305B (en) Information structuring method and device
CN107943911A (en) Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
CN107330011A (en) The recognition methods of the name entity of many strategy fusions and device
CN111159363A (en) Knowledge base-based question answer determination method and device
CN110457689B (en) Semantic processing method and related device
CN111723870B (en) Artificial intelligence-based data set acquisition method, apparatus, device and medium
CN111191051A (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN112395392A (en) Intention identification method and device and readable storage medium
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN111738018A (en) Intention understanding method, device, equipment and storage medium
CN111178080A (en) Named entity identification method and system based on structured information
CN110889274B (en) Information quality evaluation method, device, equipment and computer readable storage medium
Sagcan et al. Toponym recognition in social media for estimating the location of events
CN117216214A (en) Question and answer extraction generation method, device, equipment and medium
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
CN105631032A (en) Method, device and system for establishing question and answer knowledge base based on abstract semantic recommendation
CN114842982A (en) Knowledge expression method, device and system for medical information system
CN114881012A (en) Article title and content intelligent rewriting system and method based on natural language processing
CN115481240A (en) Data asset quality detection method and detection device
CN111553168A (en) Bilingual short text matching method
CN117291192B (en) Government affair text semantic understanding analysis method and system
CN111949781B (en) Intelligent interaction method and device based on natural sentence syntactic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination