CN116029299B

CN116029299B - Named entity recognition method, system and storage medium based on polysemous words

Info

Publication number: CN116029299B
Application number: CN202310323101.1A
Authority: CN
Inventors: 张广志; 成立立; 于笑博; 刘增礼
Original assignee: Beiling Rongxin Datalnfo Science and Technology Ltd
Current assignee: Beiling Rongxin Datalnfo Science and Technology Ltd
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-06-30
Anticipated expiration: 2043-03-30
Also published as: CN116029299A

Abstract

The invention provides a method, a system and a storage medium for identifying a named entity based on polysemous words, wherein the method comprises the following steps: preparing a sample data set for training; training an entity tag prediction model through a sample data set; acquiring source fields, generation time and context information of all sample data corresponding to the polysemous word, and respectively packaging the source fields, the generation time and the context information into judgment influence factors of each semantic meaning based on each sample data; based on the judgment influence factors of each semantic and the sample entity labels of the corresponding sample data, analyzing the corresponding relation between the judgment influence factors and the polysemous words; deep learning is carried out on the entity tag prediction model based on the corresponding relation, and parameters of the entity tag prediction model are optimized; and acquiring source field, generation time and context information of the data to be processed, and outputting a corresponding entity tag through the optimized entity tag prediction model. The invention can realize accurate named entity recognition on the text containing the ambiguous words.

Description

Named entity recognition method, system and storage medium based on polysemous words

Technical Field

The invention relates to the technical field of named entity recognition, in particular to a method, a system and a storage medium for recognizing a named entity based on polysemous words.

Background

Named entity recognition (Named Entity Recognition, NER) is a task in the field of natural language processing (Natural Language Processing, NLP) that aims to identify entities from text and classify them into predefined good entity types, such as person names, place names, institution names, etc. Named entity recognition can not only be used as a tool for information extraction alone, but also play an important role in other tasks and applications in the field of natural language processing, such as information retrieval, automatic text summarization, question and answer, machine translation, knowledge base construction and the like.

The current mainstream method for identifying named entities is Bi-LSTM+CRF, wherein the Bi-LSTM (two-way long-short-term memory network) is a deep neural network which is very popular in deep learning, and the feature context relationship in a long sequence can be learned in the named entity identification; the CRF (conditional random field) used is a traditional machine learning method that enables learning of the context of labels in named entity recognition.

However, some words may have multiple semantics in different fields at present, and the traditional named entity recognition method is difficult to realize accurate recognition of ambiguous words.

Disclosure of Invention

In order to solve at least one technical problem, the invention provides a named entity recognition method, a named entity recognition system and a named entity recognition storage medium based on ambiguities, which can realize accurate named entity recognition of the ambiguities.

The first aspect of the invention provides a method for identifying a named entity based on ambiguities, which comprises the following steps:

preparing a sample data set for training, counting all ambiguities, and modifying the sample data set;

constructing an entity tag prediction model based on polysemous words;

analyzing a loss function of an entity tag prediction model based on the polysemous word;

training the entity tag prediction model through sample data in a sample data set, and obtaining an optimized entity tag prediction model after training is finished;

in the training process, presetting that a certain polysemous word has multiple semantemes, acquiring source fields, generation time and context information of all sample data corresponding to the polysemous word, and respectively packaging the source fields, the generation time and the context information into judgment influence factors of each semanteme based on each sample data;

based on the judgment influence factors of each semantic and the sample entity labels of the corresponding sample data, analyzing the corresponding relation between the judgment influence factors and the polysemous words;

deep learning is carried out on the entity tag prediction model based on the corresponding relation, and parameters of the entity tag prediction model are optimized;

when the entity label is predicted for the data to be processed, the source field, the generation time and the context information of the data to be processed are obtained, and the corresponding entity label is output through the optimized entity label prediction model.

In this scheme, outputting the corresponding entity tag through the optimized entity tag prediction model specifically includes:

setting an entity tag prediction model comprising a plurality of sub-models, and equally dividing a sample data set into a plurality of groups of sample data;

in the training process, training each sub-model of the entity tag prediction model based on each group of sample data to obtain a plurality of optimized sub-models;

when predicting entity labels for data to be processed, acquiring source field, generation time and context information of the data to be processed, and respectively outputting corresponding entity label predicted values through a plurality of sub-models;

performing difference calculation on each entity tag predicted value and other entity tag predicted values one by one based on each entity tag predicted value to obtain a plurality of second difference values;

judging whether the second difference value is larger than a second preset threshold value, if so, marking the predicted value of the entity tag as abnormal once;

after all the predicted values of the entity tags are compared, counting the total number of times that each predicted value of the entity tag is marked as abnormal;

judging whether the total number of times that each entity tag predicted value is marked as abnormal is greater than a third preset threshold value, if so, rejecting the corresponding entity tag predicted value, and marking the corresponding sub-model as an abnormal sub-model;

performing cluster analysis on all the reserved entity tag predicted values through a density clustering algorithm to obtain a cluster center;

and taking the predicted value of the entity label closest to the clustering center as the finally predicted entity label.

In this solution, after outputting the corresponding entity tag through the optimized entity tag prediction model, the method further includes:

evaluating the output entity label through an accuracy evaluation model to obtain an evaluation result;

and determining whether to trigger continuous training optimization on the entity tag prediction model according to the evaluation result, if the evaluation result is accurate, the continuous training on the entity tag prediction model is not required to be triggered, and if the evaluation result is inaccurate, the continuous training on the entity tag prediction model is triggered.

In the scheme, the output entity tag is evaluated through the accuracy evaluation model to obtain an evaluation result, and the method specifically comprises the following steps:

acquiring influence weights of source field, generation time and context information on ambiguous words;

n semantic terms are preset, n semantic unit vectors are made based on the same datum point, and an extension line of the n semantic unit vectors equally divides any circle taking the datum point as a circle center into n equal parts;

acquiring the pointing semantics A of the source field of the data to be processed, generating the pointing semantics B of time and the pointing semantics C of the context information;

vectorization processing is carried out on the pointing semantics A, the pointing semantics B and the pointing semantics C according to n semantic unit vectors to respectively obtain the pointing semantic unit vectors

Pointing semantic unit vector +.>

Pointing semantic unit vector +.>

；

Will point to semantic unit vectors

Pointing semantic unit vector +.>

Pointing semantic unit vector +.>

Multiplying the corresponding influence weights respectively, and performing vector sum calculation on each product vector to obtain a comprehensive pointing vector +.>

；

Determining corresponding semantics based on predicted entity tags and semantic unit vectors

；

Will synthesize the direction vector

Multiplying by semantic unit vector->

And judging whether the product is positive, if so, the evaluation result is accurate, and if negative, the evaluation result is inaccurate.

In the scheme, the method for obtaining the influence weights of the source field, the generation time and the context information on the polysemous words specifically comprises the following steps:

acquiring current big data environment information;

constructing a weight prediction model, and training the weight prediction model through samples of different big data environment information;

and predicting the influence weight of the source field, the generation time and the context information on the polysemous word by a weight prediction model based on the current big data environment information.

In this solution, after predicting, by the weight prediction model, the influence weights of the source domain, the generation time, and the context information on the ambiguities, the method further includes:

acquiring a plurality of historical data of historical time, wherein the historical data at least comprises big data environment information of the historical time and historical actual influence weights of the historical time on ambiguous words about source field, generation time and context information;

performing feature calculation based on the current big data environment information to obtain a first feature value;

performing feature calculation based on big data environment information of each historical data to obtain a second feature value;

respectively comparing and calculating the difference degree between the second characteristic value and the current first characteristic value of each historical data;

adding historical data with the difference degree smaller than a first preset threshold value into a selected database;

based on big data environment information of each historical data in the selected database, predicting the source field, the generation time and the historical prediction influence weight of the context information on the polysemous word corresponding to the historical time through a weight prediction model;

performing difference calculation on the historical actual influence weight and the historical predicted influence weight based on each piece of historical data in the selected database to obtain a first difference value;

averaging the first differences based on the total number of the historical data in the selected database to obtain an average difference;

and adding the average difference value on the basis of the predicted influence weight to obtain the corrected influence weight.

The second aspect of the present invention also provides a system for identifying a named entity based on ambiguities, comprising a memory and a processor, wherein the memory comprises a named entity identification method program based on ambiguities, and the named entity identification method program based on ambiguities realizes the following steps when being executed by the processor:

constructing an entity tag prediction model based on polysemous words;

In this solution, after outputting the corresponding entity tag through the optimized entity tag prediction model, the procedure of the named entity recognition method based on the polysemous word is further implemented when executed by the processor as follows:

The third aspect of the present invention also proposes a computer readable storage medium, in which a program for a method for identifying a named entity based on ambiguities is included, which when executed by a processor implements the steps of a method for identifying a named entity based on ambiguities as described above.

The method, the system and the storage medium for identifying the named entities based on the ambiguities can realize accurate named entity identification on the text containing the ambiguities.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a flow chart of a method of identifying named entities based on ambiguities of the present invention;

FIG. 2 illustrates a block diagram of a ambiguous word-based named entity recognition system of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

FIG. 1 illustrates a flow chart of a method of ambiguous word-based named entity recognition of the present invention.

As shown in fig. 1, a first aspect of the present invention proposes a method for identifying a named entity based on ambiguities, the method comprising:

s102, preparing a sample data set for training, counting all ambiguities, and modifying the sample data set;

s104, constructing an entity tag prediction model based on polysemous words;

s106, analyzing a loss function of the entity tag prediction model based on the polysemous word;

s108, training the entity tag prediction model through sample data in the sample data set, and obtaining an optimized entity tag prediction model after the training is finished;

s110, presetting that a certain polysemous word has multiple semantemes in the training process, acquiring the source field, the generation time and the context information of all sample data corresponding to the polysemous word, and respectively packaging the sample data into judgment influence factors of the semantemes based on the sample data;

s112, analyzing the corresponding relation between the judgment influence factors of the semantics and the ambiguities in the sample entity labels of the corresponding sample data based on the judgment influence factors of the semantics and the ambiguities;

s114, performing deep learning on the entity tag prediction model based on the corresponding relation, and optimizing parameters of the entity tag prediction model;

s116, when the entity label is predicted for the data to be processed, the source field, the generation time and the context information of the data to be processed are obtained, and the corresponding entity label is output through the optimized entity label prediction model.

According to the invention, the entity tag prediction model is trained through the sample data so as to optimize model parameters, thereby improving the accuracy of model prediction and realizing accurate named entity recognition of polysemous words.

It should be noted that the source field, the generation time and the context information have a certain influence on the entity tag of the polysemous word respectively.

According to a specific embodiment of the invention, modifying the sample dataset specifically comprises:

a flag is added after each word in each sample data, and is a disambiguation word, and is marked with 1 and 2, 1 being negative and 2 being positive.

According to an embodiment of the present invention, the calculation formula of the loss function L is:

wherein G represents the total number of tag categories, i represents the serial number of the tag category, j represents the serial number of the corresponding sample data under the tag category, q represents the total number of the corresponding sample data under the tag category,

representing a real entity tag that is to be displayed,

representing the probability that the entity tag predicts as true.

According to a specific embodiment of the present invention, outputting a corresponding entity tag through an optimized entity tag prediction model specifically includes:

In the invention, a plurality of sub-models are introduced for difference comparison, and the predicted values of the entity labels predicted by the plurality of sub-models are averaged, so that the entity labels which are more fit with reality are output. In addition, in order to reduce the influence of larger prediction errors of individual submodels on the overall predicted value, the method eliminates the entity tag predicted value with larger errors, and performs abnormal marking on the corresponding submodels so as to avoid the decrease of prediction accuracy caused by the subsequent introduction of abnormal submodels for prediction.

According to an embodiment of the present invention, after outputting the corresponding entity tag through the optimized entity tag prediction model, the method further includes:

According to the embodiment of the invention, the output entity tag is evaluated through the accuracy evaluation model to obtain an evaluation result, and the evaluation result specifically comprises:

Pointing semantic unit vector +.>

Pointing semantic unit vector +.>

；

Will point to semantic unit vectors

Sign languageSense unit vector->

Pointing semantic unit vector +.>

；

；

Will synthesize the direction vector

Multiplying by semantic unit vector->

If the vector product is positive, the meaning of the predicted entity tag is in the same direction as the meaning pointed by the source field, the generation time and the context information, so that the predicted value can be verified to be relatively accurate.

It can be understood that when n semantic unit vectors are arranged, the similarity based on the semantics is arranged in a gradient order.

It can be understood that the source domain, the generation time and the context information of the current data to be processed may have corresponding directional semantics respectively, if the source domain may have a directional semantics of the polysemous word as a, the generation time may have a directional semantics of the polysemous word as B, and the context information may have a directional semantics of the polysemous word as C.

According to the embodiment of the invention, the method for acquiring the influence weights of the source field, the generation time and the context information on the polysemous words comprises the following specific steps:

acquiring current big data environment information;

It should be noted that under different big data environments, the influence weights of source domain, generation time and context information on the ambiguities will change.

According to an embodiment of the present invention, after predicting, by the weight prediction model, the impact weight of the source domain, the generation time, the context information on the ambiguities, the method further comprises:

It can be understood that the first difference value and the average difference value are respectively the difference values of three dimensions of the source field, the generation time and the context information, and since the sum of the influence weights of the three dimensions of the source field, the generation time and the context information is equal to 1, the difference values of the three dimensions are different in positive and negative, and the sum of the difference values of the three dimensions is equal to 0.

It should be noted that, the historical actual influence weight is analyzed by manual evaluation at the big data environment information of the historical time and the corresponding prediction result matching degree.

According to the method, the difference value between the actual weight and the model predicted weight is calculated through the historical data, and the influence weight of the current prediction is corrected based on the difference value, so that the more accurate influence weight is obtained.

According to a specific embodiment of the invention, the method further comprises:

inputting sample data in a sample data set into a feature extraction network, and calculating to obtain similar neighbor distances, similar neighbor distances and feature vectors of each sample data;

calculating the sum of similar neighbor distances and similar neighbor distances of each sample data, carrying out normalization processing on the sum of the distances of each sample data, and calculating the weight of each sample data;

calculating according to the weight and the feature vector of each sample data to obtain a weighted prototype; the algorithm of the weighted prototype is specifically as follows:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,

for the weighted prototype +.>

Representing a sample dataset>

An identification number representing the input sample data,

for input sample data, ++>

For the corresponding entity tag->

For the feature vector of the sample data, +.>

For sample data set->

Weight of each sample data in ∈>

The calculation formula of (2) is as follows:

；

and->

As a Euclidean distance function, ++>

Is a parameter factor->

For normalizing the processing function, sample dataset +.>

Sample data->

Is associated with the sample data set>

Other sample data of (a)

The sum of the distances represented by the eigenvectors of (a) is +.>

The sample data are +.A sample data set of c similar classes>

All samples of->

The sum of the distances of->

；

Acquiring data to be processed and basing the data to be processed on weighted prototypes

And processing to obtain weighted data to be processed, inputting the weighted data to an optimized entity tag prediction model, and predicting to obtain an entity tag.

As shown in fig. 2, the second aspect of the present invention further proposes a system 2 for identifying a named entity based on ambiguities, comprising a memory 21 and a processor 22, wherein the memory includes a procedure for identifying a named entity based on ambiguities, and the procedure for identifying a named entity based on ambiguities when executed by the processor implements the following steps:

constructing an entity tag prediction model based on polysemous words;

and acquiring data to be processed, inputting the data to be processed into an optimized entity tag prediction model, and predicting to obtain an entity tag.

According to the embodiment of the invention, the data to be processed is obtained and input into an optimized entity tag prediction model, and the entity tag is obtained by prediction, specifically comprising:

According to an embodiment of the present invention, after outputting the corresponding entity tag through the optimized entity tag prediction model, the ambiguous term-based named entity recognition method program further implements the following steps when executed by the processor:

The named entity recognition method, the named entity recognition system and the storage medium based on the ambiguities can realize accurate named entity recognition of the ambiguities.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, randomAccess Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for identifying named entities based on ambiguities, the method comprising:

constructing an entity tag prediction model based on polysemous words;

when predicting entity labels for the data to be processed, acquiring source field, generation time and context information of the data to be processed, and outputting corresponding entity labels through an optimized entity label prediction model;

outputting the corresponding entity label through the optimized entity label prediction model, which specifically comprises the following steps:

2. The method for identifying a named entity based on ambiguities of claim 1, wherein after outputting the corresponding entity tag by the optimized entity tag prediction model, the method further comprises:

3. The method for identifying a named entity based on ambiguities of claim 2, wherein the evaluation of the output entity tag by the accuracy evaluation model to obtain the evaluation result comprises:

Pointing semantic unit vector +.>

Pointing semantic unit vector +.>

；

Will point to semantic unit vectors

Pointing semantic unit vector +.>

Pointing semantic unit vector +.>

；

；

Will synthesize the direction vector

Multiplying by semantic unit vector->

4. A method for identifying a named entity based on ambiguities as claimed in claim 3, wherein the method for obtaining the influence weights of source domain, generation time and context information on ambiguities comprises the following steps:

acquiring current big data environment information;

5. The method of claim 4, wherein after predicting the impact weight of the source domain, the generation time, and the context information on the ambiguities by the weight prediction model, the method further comprises:

6. The system for identifying the named entity based on the ambiguities is characterized by comprising a memory and a processor, wherein the memory comprises a named entity identification method program based on the ambiguities, and the named entity identification method program based on the ambiguities realizes the following steps when being executed by the processor:

constructing an entity tag prediction model based on polysemous words;

7. The ambiguous term-based named entity recognition system of claim 6, wherein the ambiguous term-based named entity recognition method program when executed by the processor further performs the steps of:

8. A computer readable storage medium, characterized in that it comprises a method program for identifying a named entity based on ambiguities, which, when being executed by a processor, implements the steps of a method for identifying a named entity based on ambiguities according to any one of claims 1 to 5.