CN112487816A

CN112487816A - Named entity identification method based on network classification

Info

Publication number: CN112487816A
Application number: CN202011472395.7A
Authority: CN
Inventors: 苏延森; 张宽宏; 程凡
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-03-12
Anticipated expiration: 2040-12-14
Also published as: CN112487816B

Abstract

The invention discloses a named entity identification method based on network classification, which comprises the following steps: 1: inputting named entity training sample text data and converting the named entity training sample text data into vector data; step 2: preprocessing the named entity training sample data; and step 3: constructing a network training named entity recognition model by iteratively selecting partial samples; named entity recognition includes: and 4, step 4: inputting sample data of a named entity to be identified; and 5: preprocessing the sample data of the named entity to be identified; step 6: and identifying sample data of the named entity to be identified through the named entity classification model, and judging the category of the named entity to which the sample data belongs. The method can quickly and effectively extract the key attributes of the named entity from massive texts and identify the category of the entity, improves the efficiency of named entity identification, and provides a basis for information extraction, question answering systems, syntactic analysis, machine translation and the like.

Description

Named entity identification method based on network classification

Technical Field

The invention relates to the field of natural language processing technology and named entity identification, in particular to a named entity identification method based on network classification.

Background

Named Entity Recognition (NER), also called "proper name Recognition", refers to recognizing entities with specific meaning in text, mainly including names of people, places, organizations, proper nouns, etc. It generally comprises two parts: (1) identifying entity boundaries; (2) entity categories (person name, place name, organization name, or others) are determined. NER is a fundamental key task in NLP. From the flow of natural language processing, NER can be regarded as one of the identification of unknown words in lexical analysis, and is a problem that the number of the unknown words is the largest, the identification difficulty is the largest, and the influence on the word segmentation effect is the largest. Meanwhile, the NER is also the basis of a plurality of NLP tasks such as relation extraction, event extraction, knowledge graph, machine translation, question-answering system and the like.

The focus of the named entity identification information extraction task is urgent in actual production, but the named entities are infinite in number, flexible in word formation, fuzzy in category and the like, and the named entities are difficult to identify. Traditional classification algorithms only take into account physical characteristics (such as similarity, distance, distribution, etc.) between data, and do not take into account semantic characteristics (such as the possible presence of contextual semantic information in text) between data.

Traditional classification learning methods, such as SVM and some other network-based classification algorithms, require the use of all training data in practical implementations, and the noise present in the enormous amount of data can reduce the efficiency of named entity recognition.

Disclosure of Invention

The invention provides a named entity identification method based on network classification to overcome the defects of the prior art, so that a classification network can be constructed by selecting part of named entity identification samples and the named entity samples to be detected are identified, the identification efficiency of the named entities is improved, and technical support is further provided for information extraction, question-answering system, syntactic analysis, machine translation and the like.

In order to achieve the purpose, the invention adopts the technical scheme that:

the invention relates to a named entity recognition method based on network classification, which is characterized by comprising the following steps:

the method comprises the following steps: training a named entity classification model:

step 1.1: obtaining text data of T named entity samples, and converting the text data into vector data psi ═ by using Word2Vec natural language processing tool₁,y₁),(x₂,y₂),…,(x_t,y_t),…,(x_T,y_T))，(x_t,y_t) Vector data representing the t-th named entity sample, where x_tRepresents the attribute characteristics of the tth named entity sample, and

representing the attribute characteristics of the d-th named entity in the t-th named entity sample; y is_tA label representing the tth named entity sample, T ═ 1,2, …, T;

step 1.2: attribute feature x for the tth named entity sample_tCarrying out standardization processing to obtain the characteristic vector of the t named entity sample

Representing the d characteristic of the named entity in the t sample of the named entity;

step 1.3: respectively constructing two objective functions f by using an equation (1) and an equation (2)₁And f₂：

min f₁＝Rr(V_s) (1)

In the formula (1), V_sFor vector data selected from the T vector data Ψ, Rr (V)_s) Representing selected vector data V_sThe ratio of T vector data Ψ;

in the formula (2), the reaction mixture is,

to use the selected vector data V_sConstructing a classification network;

for classifying networks

The classification accuracy of (2);

step 1.4: taking a set of vector data of S named entity samples to be selected as an initial population P ═ { P }₁,...,p_S}，p_SRepresenting the vector data set of the S named entity sample to be selected and combining the vector data set as an individual;

encoding the initial population P by adopting a binary code with the length of T; if the individual p_SThe ith bit in the binary code of (1) represents the attribute characteristic x of the ith named entity sample_tIs selected and used to construct a classification network

Step 1.5: defining the current iteration times as N and the maximum iteration times as N; and initializing n-1; taking the initial population P as the parent population P of the nth iterationⁿ；

Step 1.6: parent population P iterated from nth through binary championshipsⁿIn which two individuals p are randomly selected_xAnd p_yAnd respectively construct a classification network

And

if classifying the network

Higher accuracy than classification networks

The parent population P from the nth iterationⁿAcquiring higher than classified networks

All individuals of precision and randomly selecting an individual p from them_z(ii) a For individual p_yAnd p_zPerforming cross mutation to obtain mutated individual p'_yAnd p'_z(ii) a From an individual p_y、p′_yAnd'_zThe individual with the highest classification network precision is selected to replace the individual p_y(ii) a Finally by the replaced individual p_yWith the individual p_xPerforming cross mutation to generate the offspring P of the nth iteration^′n；

Step 1.7: the parent population P of the nth iterationⁿAnd the child P of the nth iteration^′nMerging to obtain a merged population of the nth iteration, and obtaining any individual p in the merged population of the nth iteration by using a formula (3)ⁿImportance of (i) IMP (p)ⁿ)：

IMP(pⁿ)＝α×Acc(pⁿ)+(1-α)×(-Red(pⁿ)) (3)

In the formula (3), alpha is a compromise factor Acc (p)ⁿ) Is an individual pⁿPrecision of (1), Red (p)ⁿ) Is an individual pⁿAnd has:

Red(pⁿ)＝(a₁×b₁+a₂×b₂+...+a_i×b_i+...+a_m×b_m)/m (4)

in the formula (4), m is the nth timeIterative merging of populations except individual pⁿNumber of individuals other than; a is_iIs an individual pⁿWith the n-th iteration dividing individual p in the combined populationⁿRedundancy of the i-th individual out of the others in source space, and by the individual pⁿThe number of samples of the same named entity as the ith individual chosen is divided by T, i ∈ { 1., m }; b_iIs an individual pⁿThe redundancy in the precision target space with the ith individual is obtained by equation (5):

in the formula (5), Acc (i) represents the accuracy of the classification network constructed by the ith individual, Acc (p)ⁿ) Representing an individual pⁿThe accuracy of the constructed classification network;

step 1.8: obtaining all individuals p in the combined population of the nth iteration according to the formula (3)ⁿAnd selects the first S individuals as the parent population P of the (n + 1) th iterationⁿ；

Step 1.9: assigning N +1 to N, judging whether N is greater than N, if so, selecting vector data of a named entity sample corresponding to an individual with the highest classification network precision in the parent population of the Nth iteration and using the vector data to construct an optimal network classifier, and executing the step two, otherwise, returning to the step 1.6 to execute;

step two: named entity recognition:

step 2.1: inputting text data of a named entity sample to be identified, processing according to the step 1.1 and the step 1.2, and obtaining a feature vector of the sample to be detected;

step 2.3: and classifying the characteristic vectors of the samples to be detected by using the optimal network classifier, wherein the obtained labels represent named entities corresponding to the samples to be detected.

The named entity recognition method based on network classification is characterized in that the classification network in the formula (6)

The method is a construction mode of a k-associative optimal graph adopting Euclidean distance, and comprises the following steps:

for feature vectors

The Euclidean distance d between d feature vectors related to the named entity in the t named entity sample and d feature vectors related to the named entity in the i named entity sample is obtained by using the formula (6)_tiAnd selecting k named entities with the same category and the nearest distance to establish network connection, thereby forming a classified network:

in the formula (6), the reaction mixture is,

representing the d-th named entity related feature vector in the t-th named entity sample.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention is different from the traditional classification method, provides a named entity identification method based on network classification, comprehensively considers the physical and semantic characteristics of sample data of a named entity, constructs a classification network by screening and training the sample data of the named entity, and eliminates noise points, thereby being capable of identifying the named entity more efficiently.

2. The present invention defines two goals: the number of samples in the selected named entity identification sample set and the selected named entity identification sample set constitute the optimization problem of the classification precision of the network, high-quality named entity sample data are selected by optimizing the two points, and the classification network with better classification effect is constituted, so that the performance and the accuracy of named entity identification are improved.

3. In the iteration process, a solution generating strategy based on precision preference is adopted, and precision guidance is carried out on a low-precision named entity identification sample set to obtain more excellent filial generation, so that the quality of the to-be-constructed classification network is effectively improved, and the classifier finally used for named entity identification has better classification effect and higher identification accuracy.

4. In the process of selecting the next generation named entity identification sample set, the importance-based solution selection strategy is adopted, and the method can enter the next generation through more excellent importance sorting selection of all named entity identification sample sets, so that continuous optimization in the iteration process is ensured, and the classifier finally used for named entity identification has better classification effect and more excellent performance.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

In this embodiment, a method for identifying a named entity based on network classification includes a step of training a named entity classification model and a step of identifying the named entity, and specifically, as shown in fig. 1, the method includes the following steps:

step 1.1: taking named entity identification of a person name as an example, text data of T named entity samples is obtained, and the text data is converted into vector data psi ═ x by using a Word2Vec natural language processing tool₁,y₁),(x₂,y₂),…,(x_t,y_t),…,(x_T,y_T))，(x_t,y_t) Vector data representing the t-th named entity sample, where x_tRepresents the attribute characteristics of the tth named entity sample, and

the attribute characteristics of the d-th named entity in the sample of the t-th named entity, namely the attribute characteristics describing the name of the t-th person, are shown, and the birth time, the native place, the height, the weight, the nickname, the main contribution and the like are common; y is_tRepresenting the tth named entityThe label of the sample is a mark that the named entity belongs to a certain category, and is a name of a person. Thus converting the named entity recognition problem into a multi-class problem, y_tThe name of the person represented by the label is described in the tth named entity sample, and T is 1,2, …, T;

step 1.3: respectively constructing two objective functions f by using an equation (1) and an equation (2)₁And f₂The goals are to minimize:

min f₁＝Rr(V_s) (1)

in the formula (2), the reaction mixture is,

to use the selected vector data V_sConstructing a classification network;

for classifying networks

The classification accuracy of (2);

step 1.4: set of vector data in S named entity samples to be selectedAs an initial population P ═ P₁,...,p_S}，p_SRepresenting the vector data set of the S named entity sample to be selected and combining the vector data set as an individual;

For example, assume a total of 10 named entity samples and p_SIn (3), 5, 8, 9) is 1, then p_SSelectively naming an entity recognition sample set as (x)₃,x₅,x₈,x₉)；

And

classifying by using the constructed network; if classifying the network

Higher accuracy than classification networks

All individuals of precision and randomly selecting an individual p from them_z(ii) a For individual p_yAnd p_zPerforming cross mutation to obtain mutated individual p'_yAnd p'_z(ii) a From an individual p_y、p′_yAnd p'_zThe individual with the highest classification network precision is selected to replace the individual p_yThus, poor ones of the two are guided and excellent guided individuals are obtained; finally by the replaced individual p_yWith the individual p_xPerforming cross mutation to generate the offspring P of the nth iteration^′n；

IMP(pⁿ)＝α×Acc(pⁿ)+(1-α)×(-Red(pⁿ)) (3)

In formula (3), α is a compromise factor, usually 0.8, Acc (p)ⁿ) Is an individual pⁿPrecision of (1), Red (p)ⁿ) Is an individual pⁿThe importance obtained by integrating the accuracy and the redundancy has a more balanced evaluation on the individuals, and the method comprises the following steps:

Red(pⁿ)＝(a₁×b₁+a₂×b₂+...+a_i×b_i+...+a_m×b_m)/m (4)

in the formula (4), m is the dividing individual p in the combined population of the nth iterationⁿNumber of individuals other than; a is_iIs an individual pⁿWith the n-th iteration dividing individual p in the combined populationⁿRedundancy of the i-th individual out of the others in source space, and by the individual pⁿThe number of samples of the same named entity as the ith individual chosen, i ∈ { 1., m }, a, is divided by T to yield_iThe larger the indication of an individual pⁿThe higher the redundancy in source space with the individual i; b_iIs an individual pⁿThe redundancy of the ith individual in the precision target space is combined with the redundancy of the source space and the precision target space, the redundancy analysis of each individual is clear and reasonable, and the judgment effect on the subsequent importance is larger, so thatFormula (5) is obtained:

in the formula (5), Acc (i) represents the accuracy of the classification network constructed by the ith individual, Acc (p)ⁿ) Representing an individual pⁿAccuracy of the constructed classification network, b_iThe larger the indication of an individual pⁿThe higher the spatial redundancy with the individual i at the precision target;

step two: and (3) named entity identification, namely classifying the sample to be detected by utilizing the most network classifier obtained in the step one:

step 2.1: inputting text data of a named entity sample to be identified, processing according to the step 1.1 and the step 1.2, and obtaining a feature vector of the sample to be detected, wherein common features comprise birth time, native height, weight, nickname, main contribution and the like;

2. A method for named entity recognition based on network classification according to claim 1, characterised in that the classification network in formula (6)

for feature vectors

in the formula (6), the reaction mixture is,

The method is tested and verified by objectively collected data.

1) Acquiring text data of named entity samples related to the names of people, namely acquiring sentences or paragraphs related to the names of people in documents, converting the text data of the real world into vector data which can be processed by a computer by using a Word2Vec tool, dividing a processed data set into training samples and test samples, selecting the optimal training samples through cross validation by ten folds to construct a classification network, and carrying out named entity recognition on the test samples.

2) And evaluating the index;

and the classification precision is used as an evaluation index of the example to evaluate the performance of named entity recognition. The higher the precision is, the better the classification effect is represented, and the higher the identification accuracy is.

3) Performing an experiment on the data set;

the effectiveness of the invention was verified by experimental results on a data set. Today, the information is highly diversified, named entities are accurately and efficiently identified from texts, and the analysis of the named entities is particularly important. Experiments show that the method can quickly and effectively extract the key attributes of the named entities from massive texts and identify the categories of the entities, improves the efficiency of named entity identification, and provides a basis for information extraction, question-answering systems, syntactic analysis, machine translation and the like.

Claims

1. A named entity recognition method based on network classification is characterized by comprising the following steps:

minf₁＝Rr(V_s) (1)

in the formula (2), the reaction mixture is,

to use the selected vector data V_sConstructing a classification network;

for classifying networks

The classification accuracy of (2);

Step 1.6: parent population P iterated from nth through binary championshipsⁿIn which two individuals p are randomly selected_xAnd p_yAnd are constructed separatelyBuilding classification networks

And

if classifying the network

Higher accuracy than classification networks

All individuals of precision and randomly selecting an individual p from them_z(ii) a For individual p_yAnd p_zPerforming cross mutation to obtain mutated individual p'_yAnd p'_z(ii) a From an individual p_y、p′_yAnd p'_zThe individual with the highest classification network precision is selected to replace the individual p_y(ii) a Finally by the replaced individual p_yWith the individual p_xPerforming cross-mutation to generate offspring P 'of n iteration'ⁿ；

Step 1.7: the parent population P of the nth iterationⁿAnd child P 'of nth iteration'ⁿMerging to obtain a merged population of the nth iteration, and obtaining any individual p in the merged population of the nth iteration by using a formula (3)ⁿImportance of (i) IMP (p)ⁿ)：

IMP(pⁿ)＝α×Acc(pⁿ)+(1-α)×(-Red(pⁿ)) (3)

Red(pⁿ)＝(a₁×b₁+a₂×b₂+...+a_i×b_i+...+a_m×b_m)/m (4)

in the formula (4), m is the dividing individual p in the combined population of the nth iterationⁿNumber of individuals other than; a is_iIs an individual pⁿWith the n-th iteration dividing individual p in the combined populationⁿRedundancy of the i-th individual out of the others in source space, and by the individual pⁿThe number of samples of the same named entity as the ith individual chosen is divided by T, i ∈ { 1., m }; b_iIs an individual pⁿThe redundancy in the precision target space with the ith individual is obtained by equation (5):

step two: named entity recognition:

2. The named entity recognition based on network classification as claimed in claim 1Method, characterized in that the classification network in said formula (6)

for feature vectors

in the formula (6), the reaction mixture is,