CN111178080A

CN111178080A - Named entity identification method and system based on structured information

Info

Publication number: CN111178080A
Application number: CN202010002138.0A
Authority: CN
Inventors: 周彬; 牛迪; 任天成
Original assignee: Hangzhou Tuya Information Technology Co Ltd
Current assignee: Hangzhou Tuya Information Technology Co Ltd
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2020-05-19
Anticipated expiration: 2040-01-02
Also published as: CN111178080B

Abstract

The application discloses a named entity identification method based on structured information, which comprises the following steps: structuring a processing sentence and obtaining a processing result; obtaining a structural feature according to the processing result; and preprocessing or sequence labeling the sentences according to the distribution of the structural features. Compared with the prior art, the method has the following beneficial effects: the Chinese named entity recognition method based on the structural information of the characters and the words is provided, a domain dictionary is constructed to cut words so as to ensure the accuracy of entity boundaries, semantic information contained in each character and word is analyzed from structural characteristics, and then the semantic information is used as a basis for judging the entities.

Description

Named entity identification method and system based on structured information

Technical Field

The application relates to the field of named entity identification, in particular to a named entity identification method based on structured information.

Background

Named Entity Recognition (NER), also known as proper name recognition, is a task in information extraction and has a wide application range. Named entities generally refer to entities in text that have a particular meaning or strong reference, and typically include names of people, places, organizations, dates and times, and the like. The NER system extracts the entities from the unstructured input text and can identify more classes of entities according to business requirements.

Currently, the algorithm for named entity recognition mainly utilizes machine learning and deep learning models to label a sequence of a single sentence. The sequence annotation refers to marking each character or word in a sentence, for example, "i love China" is an entity of a place name. However, in practical applications, a word may point to multiple entities, such as "red" in "play" tv series "is an entity of tv series name, and" red "in" set light to red "is a light color, and is a common word. This phenomenon is also called word ambiguity. When processing such ambiguous words, the current algorithm model only utilizes the representation information of characters or words, does not utilize deeper structural features, is difficult to accurately predict entities, and combines the above examples, the existing algorithm judges the red in all sentences as the television series names with high probability.

On the other hand, the current named entity recognition algorithm has the problem of wrong boundary division. First, the named entity recognition algorithm of chinese can be classified into a word-based method and a character-based method. The word-based method is to perform word segmentation and then perform entity judgment on the words. However, word segmentation errors can cause entity boundary errors, and the problem is serious in the open domain because word segmentation across domains is still an unsolved problem. For example, "Nanjing Yangtze river bridge" may be segmented into "Nanjing/City Yangtze/river bridge", and the current algorithm may judge "river bridge" as a person's name. The character-based named entity recognition does not need to divide words in advance, and whether the word belongs to one part of the entity or not is directly judged on the word. Although the defects of some word segmentation can be overcome, the method cannot utilize explicit word and word sequence information.

Disclosure of Invention

The main objective of the present application is to provide a named entity identification method based on structured information, which includes:

structuring a processing sentence and obtaining a processing result;

obtaining a structural feature according to the processing result; and

and preprocessing or labeling the sequence of the sentence according to the distribution of the structural features.

Optionally, the structuring and obtaining the processing result comprises: utilizing a database to perform word segmentation processing on the sentence, performing matching retrieval on each word in the sentence, and obtaining three processing results aiming at each word: 1) the word is not present in any one of the databases; 2) the word exists in only one database; 3) the word is stored in two or more databases.

Optionally, obtaining the structured feature according to the processing result includes:

according to the processing result, obtaining the structural characteristics of each word in the sentence, and obtaining three types of distribution results of the sentence according to the structural characteristics of the words: 1) each word in the sentence exists in only one database; 2) more than one word in the sentence does not exist in any database; more than one word in the sentence exists in two or more databases.

Optionally, according to the distribution of the structural features, the sentence preprocessing or sequence labeling includes:

when the distribution result of the structural features belongs to the category 1), giving a corresponding label to each word by the rule layer according to the type of the corresponding database; and when the distribution result of the structural features belongs to the category 2) or 3), preprocessing the sentence, and labeling each word in the sentence through a sequence labeling algorithm model so as to judge the entity type of the sentence.

Optionally, the sequence annotation model is applied to a deep learning method, including:

constructing the characteristics of words, and performing vectorization processing on each word in the sentence to obtain a word vector;

constructing character features, and inputting the word vectors into a neural network of a rolling machine to obtain the features of the characters;

matching and searching each word in the sentence through a database to obtain the structural feature of each word, and vectorizing the structural feature to obtain a structural feature vector;

splicing the word vector, the character features and the structural feature vector to obtain a splicing result, and inputting the splicing result to a bidirectional long-short term memory model to obtain a hidden layer vector;

inputting the hidden layer vector to a full-connection network to obtain probability distribution which is not subjected to normalization processing in deep learning;

and inputting the probability distribution which is not subjected to normalization processing in the deep learning to a conditional random field to obtain an entity label.

Optionally, the pre-processing comprises: stop word recognition, numerical value conversion and wrongly written character correction.

According to an aspect of the present application, there is provided a named entity recognition system based on structured information, including: a preprocessing module, a structural processing module, a rule layer module and a sequence labeling model module,

the structured processing module is used for carrying out structured processing on sentences, obtaining processing results and obtaining structured features according to the processing results; the preprocessing module is used for preprocessing sentences according to the distribution of the structural features; and the rule layer module is used for carrying out sentence sequence marking according to the distribution of the structural characteristics through the sequence marking module.

The application also discloses a computer device, which comprises a memory, a processor and a computer program stored in the memory and capable of being executed by the processor, wherein the processor realizes the method of any one of the above items when executing the computer program.

The application also discloses a computer-readable storage medium, a non-volatile readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method of any of the above.

The present application also discloses a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method of any of the above.

Compared with the prior art, the method has the following beneficial effects:

named entity recognition is carried out on the basis of structural information of characters and words. Analyzing structural characteristics by analyzing characters and words in a sentence, and analyzing semantic information contained in the structural characteristics as a basis for judging entity types;

in the vectorization operation input by the sequence labeling model, a word vector, structural features obtained through dictionary matching retrieval and character features obtained through CNN coding are fused.

The named entity recognition system applies the structural characteristics to divide sentences into three types, and entity categories can be judged through a rule layer and a sequence marking model respectively.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:

1-3 are schematic flow diagrams of a named entity identification method based on structured information according to an embodiment of the present application;

FIGS. 4-5 are schematic diagrams of sequence labeling processes according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a computer device according to one embodiment of the present application; and

FIG. 7 is a schematic diagram of a computer-readable storage medium according to one embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to fig. 1 to fig. 3, an embodiment of the present application provides a named entity identification method based on structured information, including:

structuring a processing sentence and obtaining a processing result;

obtaining a structural feature according to the processing result; and

In an embodiment of the present application, the structuring process and obtaining the processing result includes: utilizing a database to perform word segmentation processing on the sentence, performing matching retrieval on each word in the sentence, and obtaining three processing results aiming at each word: 1) the word is not present in any one of the databases; 2) the word exists in only one database; 3) the word is stored in two or more databases.

In an embodiment of the present application, obtaining the structural feature according to the processing result includes:

In an embodiment of the present application, performing sentence preprocessing or sequence labeling according to the distribution of the structural features includes:

Referring to fig. 4-5, in an embodiment of the present application, the sequence annotation model is used for a deep learning method, including:

In an embodiment of the present application, the preprocessing includes: stop word recognition, numerical value conversion and wrongly written character correction.

The application also provides a named entity recognition system based on structured information, which comprises: preprocessing, structural processing, a rule layer and a sequence labeling model.

Step 1: and (5) structuring treatment.

In the step, the dictionary is used for carrying out word segmentation on the sentence, then each word in the sentence is subjected to matching retrieval, and three results are obtained aiming at each word:

the word does not exist in any dictionary;

the word only exists in one dictionary;

③ the word exists in two or more dictionaries

Step 2: and (5) structural feature analysis.

In this step, according to the processing result of step 1, the structural features of each word in the sentence are obtained, and at this time, the sentence is divided into three categories according to the structural feature distribution of the word:

each word in the sentence only exists in one dictionary;

there is more than one word in the sentence, and it does not exist in any dictionary;

③ more than one word in the sentence exists in two or more dictionaries

if the structural feature distribution result of the sentence belongs to the category ①, the step 4 is directly entered, and if the structural feature distribution result of the sentence belongs to the category ② or ③, the step 3 is entered.

And step 3: and (5) sentence preprocessing.

The sentence is processed by the preprocessing module, including but not limited to stop word recognition, numerical transformation, wrongly written character correction, etc.

And 4, step 4: sequence labeling

If the last step comes from step 2, that is, each word in the sentence exists in only one dictionary, the rule layer of the system is entered. Since each word belongs to only one dictionary, the rule layer gives each word a corresponding label according to the type of the dictionary. If "China" only belongs to the place name dictionary, the tag of the place name entity is given. And if the 'I' only belongs to the labels of the common dictionary, the labels of the common words are given.

If the last step is from step 3, labeling each word in the pre-sentence by using the algorithm model of sequence labeling, thereby judging the entity type of the word. The sequence labeling model adopts a deep learning method, and a specific algorithm model structure is shown in fig. 4-5 and can be divided into the following sub-steps:

step 4.1:

and constructing the characteristics of words, and performing vectorization processing on each word in the sentence to obtain a word vector.

Step 4.2:

the character features are constructed, one word may be composed of one or more characters, and after each character is subjected to vectorization processing, the characters are input to a CNN (computerized neural network) to obtain the features of the characters.

Step 4.3:

and performing matching retrieval on each word in the sentence by using the dictionary to obtain the structural feature of each word, and performing vectorization processing on the feature to obtain a structural feature vector.

Step 4.4:

the word vectors, character features and structured feature vectors are spliced to form the input of a BILSTM (Bi-directional Long Short Term Memory model), and the hidden vector h is output after the input of the Bi-directional Long Short Term Memory model.

Step 4.5:

the vector h is input to FC (fully Connected Layer) and output to registers (probability distribution without normalization in deep learning).

Step 4.6

logtis as input to CRF (Conditional Random Field) to generate final entity labels (name, place, common words, etc.)

After the sentence passes through the step 4, the entity label corresponding to each word is judged through a rule layer or an algorithm model. At this point, the named entity identification process based on the structured information ends.

Compared with the prior art, the method has the following beneficial effects:

The method for identifying the Chinese named entity based on the structural information of the characters and the words is provided, a domain dictionary is constructed to cut words so as to ensure the accuracy of the entity boundary, semantic information contained in each character and word is analyzed from the structural characteristics, and then the semantic information is used as a basis for judging the entity.

Referring to fig. 6, the present application further provides a computer device including a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of any one of the above methods when executing the computer program.

Referring to fig. 7, a computer-readable storage medium, a non-volatile readable storage medium, having stored therein a computer program which, when executed by a processor, implements any of the methods described above.

A computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method of any of the above.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A named entity identification method based on structured information is characterized by comprising the following steps:

structuring a processing sentence and obtaining a processing result;

obtaining a structural feature according to the processing result; and

2. The method for named entity recognition based on structured information as claimed in claim 1, wherein the structured processing and obtaining the processing result comprises: utilizing a database to perform word segmentation processing on the sentence, performing matching retrieval on each word in the sentence, and obtaining three processing results aiming at each word: 1) the word is not present in any one of the databases; 2) the word exists in only one database; 3) the word is stored in two or more databases.

3. The named entity recognition method based on structured information as claimed in claim 2, wherein obtaining the structured features according to the processing result comprises:

4. The method for named entity recognition based on structured information as claimed in claim 3, wherein the sentence preprocessing or sequence labeling according to the distribution of the structured features comprises:

5. The named entity recognition method based on structured information as claimed in claim 4, wherein the sequence annotation model is applied to a deep learning method, comprising:

6. The method for named entity recognition based on structured information as claimed in claim 5, wherein the preprocessing comprises: stop word recognition, numerical value conversion and wrongly written character correction.

7. A named entity recognition system based on structured information, comprising: a preprocessing module, a structural processing module, a rule layer module and a sequence labeling model module,

8. A computer device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of any one of claims 1-6 when executing the computer program.

9. A computer-readable storage medium, a non-transitory readable storage medium, having stored therein a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any one of claims 1-6.

10. A computer program product comprising computer readable code that, when executed by a computer device, causes the computer device to perform the method of any of claims 1-6.