CN117634489A

CN117634489A - Sample selection method, sample selection system, identification method and identification system

Info

Publication number: CN117634489A
Application number: CN202311609958.6A
Authority: CN
Inventors: 陆瑾; 刘志伟; 王晓伟; 杨涛; 魏申平; 薛斌
Original assignee: China Electronics Investment Holdings Ltd
Current assignee: China Electronics Investment Holdings Ltd
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-03-01
Anticipated expiration: 2043-11-29
Also published as: CN117634489B

Abstract

The embodiment of the invention provides a sample selection method, a sample selection system, an identification method and an identification system, and relates to the field of entity identification, wherein the method comprises the following steps: inputting unlabeled samples into a preliminary entity recognition model to perform reasoning recognition on the entities in the unlabeled samples, and outputting reasoning results corresponding to the unlabeled samples; determining probability distribution difference values of various types of entities in the reasoning results based on a sampling model of probability distribution differences according to the reasoning results corresponding to each unlabeled sample; and taking the unlabeled samples with the probability distribution difference value not lower than a threshold value as supplementary samples. And comparing the probability distribution difference of the unlabeled sample with the entity category of the labeled sample by adopting a sampling model based on the probability distribution difference, and taking the unlabeled sample with a large probability distribution difference value as a sample of the complementary sample to train the entity recognition model again, wherein the final entity recognition model can recognize the entity with the same entity type as the entity type in the complementary sample, thereby improving the recall rate.

Description

Sample selection method, sample selection system, identification method and identification system

Technical Field

The invention relates to the field of entity identification, in particular to a sample selection method, a sample selection system, an identification method and an identification system.

Background

The task of identifying named entities is highly challenging, requiring high accuracy and recall. Because of the data-driven context coding capability of deep learning models, the deep learning model has become the dominant approach to named entity recognition tasks. The recognition of the named entity in the professional field based on the deep learning needs to achieve higher accuracy and recall rate, and a large amount of labeling data in the professional field is usually collected to fully train the model. However, many specialized fields such as finance, medical treatment, and law require a certain degree of expertise in labeling data, and the cost of acquiring and labeling data is relatively high compared to general fields. The existing sampling strategies are generally based on characteristics such as maximum length of a sample, recognition probability of the sample and the like, and comprise a random sampling strategy, an entropy-based sampling strategy, a sampling strategy based on the lowest confidence degree and an edge sampling strategy, and mainly concern the situation that the accuracy of a recognized entity is low and the problem of low recall rate cannot be solved.

In carrying out the present invention, the applicant has found that at least the following problems exist in the prior art: when more samples with professionals cannot be obtained, the recognition recall rate of the trained deep learning model to professional entities is low, and the recognition effect is poor.

Disclosure of Invention

The embodiment of the invention provides a sample selection method, a sample selection system, an identification method and an identification system, which can solve the technical problems that the identification recall rate of a trained deep learning model to a professional entity is low and the identification effect is poor when more samples with professionals cannot be obtained in the prior art.

To achieve the above object, in a first aspect, an embodiment of the present invention provides a sample selection method, including:

inference entity: inputting the unlabeled sample into a preliminary entity recognition model to perform reasoning recognition on the entity in the unlabeled sample, and outputting a reasoning result corresponding to each unlabeled sample, wherein the reasoning result comprises: entity, entity type, number of entities of each type;

calculating probability distribution differences: determining probability distribution difference values of various types of entities in the reasoning results based on a sampling model of probability distribution differences according to the reasoning results corresponding to each unlabeled sample;

Updating the labeling sample set: taking unlabeled samples with probability distribution difference values not lower than a threshold value as supplementary samples; the supplementary sample is used for supplementing the labeling sample set to form an updated labeling sample set, and the updated labeling sample set is used for training the preliminary entity recognition model again to obtain a final entity recognition model.

In a second aspect, an embodiment of the present invention provides a method for identifying a named entity, including the foregoing sample selection method;

the named entity identification method further comprises the following steps:

labeling a sample: supplementing the manually marked supplementary sample to the marked sample set to form an updated marked sample set;

updating the entity identification model: training a preliminary entity recognition model by adopting all the updated labeling samples in the labeling sample set to obtain a final entity recognition model;

entity identification: and identifying the named entity in the data to be identified through the final entity identification model.

In a third aspect, an embodiment of the present invention provides a sample selection system, including:

the entity reasoning unit is used for inputting the unlabeled sample into the preliminary entity recognition model to perform reasoning recognition on the entity in the unlabeled sample, and outputting a reasoning result corresponding to each unlabeled sample, wherein the reasoning result comprises: entity, entity type, number of entities of each type;

The probability distribution difference calculation unit is used for determining probability distribution difference values of various types of entities in the reasoning results based on a sampling model of the probability distribution difference according to the reasoning results corresponding to each unlabeled sample;

labeling sample set updating unit: the unlabeled sample with the probability distribution difference value not lower than the threshold value is used as a supplementary sample; the supplementary sample is used for supplementing the labeling sample set to form an updated labeling sample set, and the updated labeling sample set is used for training the preliminary entity recognition model again to obtain a final entity recognition model.

In a fourth aspect, an embodiment of the present invention provides a named entity recognition system, including the foregoing sample selection system;

the named entity recognition system further comprises:

the labeling unit is used for supplementing the manually labeled supplementary sample to the labeling sample set to form an updated labeling sample set;

the entity recognition model training unit is used for training a preliminary entity recognition model by adopting all the updated labeling samples in the labeling sample set to obtain a final entity recognition model;

and the entity recognition unit is used for recognizing the named entity in the data to be recognized through the final entity recognition model.

In a fifth aspect, embodiments of the present invention provide a computer-readable storage medium storing one or more programs, which when executed by a computer device, cause the computer device to perform the aforementioned sample selection method.

In a sixth aspect, an embodiment of the present invention provides a computer apparatus, including:

a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the sample selection method described previously.

The technical scheme has the following beneficial effects: the sampling model based on the probability distribution difference is adopted, the probability distribution difference of the unlabeled sample and the entity category of the labeled sample is compared to quantify the labeling value of the unlabeled sample, the unlabeled sample with large probability distribution difference is selected as a supplementary sample with labeling value, the labeled sample set is updated, the updated labeled sample set is used as a sample for retraining the entity identification model, and the final entity identification model can identify the entity with the same entity type as the supplementary sample, so that better entity identification effect can be obtained by using fewer samples, the identified recall rate is improved, and the recall effect is good.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a sample selection method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of identifying named entities according to an embodiment of the invention;

FIG. 3 is a logical block diagram of a sample selection system according to an embodiment of the present invention;

FIG. 4 is a logical block diagram of a named entity recognition system according to an embodiment of the present invention;

FIG. 5 is a logical block diagram of a computer device in accordance with an embodiment of the present invention;

fig. 6 is an example of a sample sampling function of an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Some technical terms related to the embodiment of the invention are defined as follows:

the BERT-CRF is a sequence labeling model that combines two techniques of pre-trained bi-directional encoder representation (Bidirectional Encoder Representations from Transformers, BERT) and conditional random field (Conditional Random Field, CRF).

BERT is a pre-trained language model based on a Transformer architecture that learns rich language representations by pre-training on large-scale unlabeled text. The BERT can capture the context information and the semantic relation and has strong characterization capability. In the sequence labeling task, the BERT may act as an encoder to convert an input sequence into a contextually relevant representation.

CRF is a statistical model commonly used for sequence labeling tasks. The method can perform consistency modeling and optimization on the tag sequence by modeling global features of the tag sequence and transition probabilities between tags. The CRF can solve the problem of the dependency relationship among the labels, so that the model can more accurately predict the labels at each position in the sequence.

The BERT-CRF combines the advantages of BERT and CRF, and can not only learn language representation related to context by using the BERT model, but also model the dependency relationship between labels by using the CRF model. This allows BERT-CRF to exhibit better performance in sequence Tagging tasks, such as named entity recognition (Named Entity Recognition, NER), part-of-Speech Tagging (Part-of-Speech Tagging), and the like. By jointly training the BERT and the CRF, the BERT-CRF can achieve more accurate and consistent label prediction in the sequence labeling task.

BiLSTN-CRF is a sequence labeling model and combines two technologies, namely a bidirectional long and short Term Memory network (Bidirectional Long Short-Term Memory, biLSTM) and a conditional random field (Conditional Random Field, CRF).

BiLSTM is a variant of a recurrent neural network (Recurrent Neural Network, RNN) that is capable of efficiently capturing contextual information in sequence data. By introducing two LSTM layers in the model, biLSTM can consider the context of the current position at the same time, so that the dependency relationship and semantic information in the sequence can be better understood.

As shown in fig. 1, in combination with an embodiment of the present invention, there is provided a sample selection method including:

s101, reasoning entity: inputting the unlabeled sample into a preliminary entity recognition model to perform reasoning recognition on the entity in the unlabeled sample, and outputting a reasoning result corresponding to each unlabeled sample, wherein the reasoning result comprises: entity, entity type, number of entities of each type;

s102, calculating probability distribution differences: determining probability distribution difference values of various types of entities in the reasoning results based on a sampling model of probability distribution differences according to the reasoning results corresponding to each unlabeled sample;

S103, updating the labeling sample set: taking unlabeled samples with probability distribution difference values not lower than a threshold value as supplementary samples; the supplementary sample is used for supplementing the labeling sample set to form an updated labeling sample set, and the updated labeling sample set is used for training the preliminary entity recognition model again to obtain a final entity recognition model.

The sampling model based on the probability distribution difference is adopted, the probability distribution difference of the unlabeled sample and the entity category of the labeled sample is compared to quantify the labeling value of the unlabeled sample, the unlabeled sample with large probability distribution difference is selected as a supplementary sample with labeling value, the labeled sample set is updated, the updated labeled sample set is used as a sample for retraining the entity identification model, and the final entity identification model can identify the entity with the same entity type as the supplementary sample, so that better entity identification effect can be obtained by using fewer samples, the identified recall rate is improved, and the recall effect is good.

Preferably, the sample selection method further comprises: constructing a sampling model based on probability distribution differences, and constructing the sampling model based on probability distribution differences specifically comprises the following steps:

obtaining a labeling result of each labeling sample in the labeling sample set after manual labeling, wherein the labeling result comprises: entity, entity type, number of entities of each type;

And determining a Gaussian probability density function of the entity type based on the number of the entities of each type in each labeling sample, and taking the Gaussian probability density function as a sampling model based on the probability distribution difference.

Preferably, the method for constructing the sampling model based on the probability distribution difference specifically comprises the following steps:

and carrying out normalization processing on the Gaussian probability density function, controlling the value of the Gaussian probability density function within the range of [0,1] to obtain a sample sampling function, and taking the sample sampling function as a final sampling model based on probability distribution difference. The value of the Gaussian probability density function is controlled in the range of [0,1], and when the probability difference value of the unlabeled sample is obtained through a sampling model based on the probability distribution difference, the unlabeled sample corresponding to the probability distribution difference value is easy to distinguish.

Preferably, the sample selection method further comprises:

updating a sampling model based on the probability distribution difference: updating the parameter value of the sampling model based on the probability distribution difference according to each labeling sample in the updated labeling sample set to obtain an updated sampling model based on the probability distribution difference, wherein the updated sampling model based on the probability distribution difference is used for selecting a new unlabeled sample as a supplementary sample; wherein the parameter values include: the mean matrix of the number of the entities of each type and the covariance matrix formed by the number of the entities of each type.

The parameter values of the sampling model based on the probability distribution difference are updated, the sampling model based on the probability distribution difference is used for selecting new unlabeled samples, samples different from labeled samples currently forming the sampling model based on the probability distribution difference can be identified, and more samples of different types are selected, so that the identification capacity of the entity identification model is improved, and the recall rate of entity identification is improved.

In combination with the embodiment of the invention, a named entity identification method is provided, which comprises any one of the sample selection methods, wherein one named entity identification method is shown in fig. 2;

the named entity identification method further comprises the following steps:

s201, labeling a sample: supplementing the manually marked supplementary sample to the marked sample set to form an updated marked sample set;

s202, updating an entity identification model: training a preliminary entity recognition model by adopting all the updated labeling samples in the labeling sample set to obtain a final entity recognition model;

s203, entity identification: and identifying the named entity in the data to be identified through the final entity identification model.

Preferably, S202, updating the entity recognition model specifically includes:

Training a preliminary entity recognition model by adopting all the updated labeling samples in the labeling sample set to obtain a current entity recognition model;

the named entity identification method further comprises the following steps:

and (3) testing and verifying: testing and verifying the current entity identification model by adopting a test set to obtain an identification result, and judging whether the accuracy of the identification result reaches a preset accuracy;

when the accuracy of the identification result does not reach the preset accuracy, sequentially circularly reasoning the entity, calculating the probability distribution difference, updating the labeling sample set, labeling the sample and updating the entity identification model according to the new unlabeled sample until the identification result obtained by testing and verifying the current entity identification model reaches the preset accuracy in the step test and verification, stopping sequentially circulating, and taking the current entity identification model as a final entity identification model;

when the accuracy of the identification result reaches the preset accuracy, the current entity identification model is used as the final entity identification model.

When the accuracy of the identification result reaches the preset accuracy, the recall rate of the named entity can be improved when the entity identification model is used for identifying the data to be identified.

As shown in fig. 3, in connection with an embodiment of the present invention, there is provided a sample selection system including:

the entity inference unit 31 is configured to input the unlabeled samples into the preliminary entity recognition model to perform inference recognition on the entities in the unlabeled samples, and output inference results corresponding to the unlabeled samples, where the inference results include: entity, entity type, number of entities of each type;

a probability distribution difference calculation unit 32, configured to determine, for each inference result corresponding to the unlabeled sample, a probability distribution difference value of each type of entity in the inference result based on a sampling model of the probability distribution difference;

a labeled sample set updating unit 33, configured to take, as a complementary sample, an unlabeled sample whose probability distribution difference value is not lower than a threshold value; the supplementary sample is used for supplementing the labeling sample set to form an updated labeling sample set, and the updated labeling sample set is used for training the preliminary entity recognition model again to obtain a final entity recognition model.

Preferably, the sample selection system further comprises a construction unit of a sampling model based on the probability distribution differences, the construction unit of the sampling model based on the probability distribution differences being configured to:

Preferably, the construction unit of the sampling model based on the probability distribution differences is further configured to:

and carrying out normalization processing on the Gaussian probability density function, controlling the value of the Gaussian probability density function within the range of [0,1] to obtain a sample sampling function, and taking the sample sampling function as a final sampling model based on probability distribution difference.

Preferably, the sample selection system further comprises:

the sampling model updating unit is used for updating the parameter value of the sampling model based on the probability distribution difference according to each labeling sample in the updated labeling sample set to obtain an updated sampling model based on the probability distribution difference, and the updated sampling model based on the probability distribution difference is used for selecting a new unlabeled sample as a supplementary sample; wherein the parameter values include: the mean matrix of the number of the entities of each type and the covariance matrix formed by the number of the entities of each type.

As shown in fig. 4, in connection with an embodiment of the present invention, there is provided a named entity recognition system, including any of the sample selection systems described above;

The named entity recognition system further comprises:

the labeling unit 41 is configured to supplement the manually labeled supplementary sample to a labeling sample set to form an updated labeling sample set;

the entity recognition model training unit 42 is configured to train the preliminary entity recognition model by using all the labeling samples in the updated labeling sample set, so as to obtain a final entity recognition model;

the entity recognition unit 43 is configured to recognize the named entity in the data to be recognized through the final entity recognition model.

Preferably, the named entity recognition system further comprises a test unit, wherein:

the entity recognition model training unit 42 is specifically configured to train the preliminary entity recognition model by using all the labeling samples in the updated labeling sample set, so as to obtain a current entity recognition model; when the accuracy of the identification result obtained in the test unit reaches the preset accuracy, taking the current entity identification model as a final entity identification model;

the test unit is used for testing and verifying the current entity identification model by adopting a test set to obtain an identification result, and judging whether the accuracy of the identification result reaches a preset accuracy;

the entity reasoning unit 31, the probability distribution difference calculating unit 32, the labeling sample set updating unit 33, the labeling unit 41 and the entity recognition model training unit 42 are all configured to, when the accuracy of the recognition result obtained in the testing unit does not reach the preset accuracy, sequentially and circularly execute the entity reasoning unit 31, the probability distribution difference calculating unit 32, the labeling sample set updating unit 33, the labeling unit 41 and the entity recognition model training unit 42 for a new unlabeled sample until the recognition result obtained by performing test verification on the current entity recognition model in the testing unit reaches the preset accuracy, and stop sequentially and circularly executing.

Preferably, in connection with an embodiment of the present invention, there is provided a computer-readable storage medium storing one or more programs, which when executed by a computer device, cause the computer device to perform any one of the aforementioned sample selection methods.

Preferably, as shown in fig. 5, in connection with an embodiment of the present invention, there is provided a computer apparatus including:

a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform any of the sample selection methods described above.

The beneficial technical effects obtained by the embodiment of the invention are as follows:

quantifying the labeling value of the unlabeled sample by comparing the probability distribution difference value of the unlabeled sample with the entity class of the labeled sample, manually labeling the unlabeled sample with a large probability distribution difference value, training the entity recognition model again, and continuously enhancing the entity recognition performance of the entity recognition model through cyclic iteration training. Practical application in financial and medical scenes shows that compared with the existing algorithm, the training sample determined by the algorithm based on the probability distribution difference can remarkably reduce the iteration times required by convergence of the entity recognition model and the labeling sample quantity required by reaching corresponding performance. Thus, better entity recognition effect can be obtained by using fewer samples. Meanwhile, under the condition of the same labeling sample size, the algorithm can improve the recall rate of the entity identification model, so that the overall performance of the entity identification model is effectively improved. The algorithm can effectively reduce the dependence of the named entity recognition task on the labeling sample, pick out the sample with higher learning value, and reduce the labeling cost.

And (3) actively learning a sampling model based on the probability distribution difference to select samples, constructing a named entity recognition model by using a deep learning technology, adding new samples by sampling the samples, retraining the entity recognition model by using a labeling sample containing the new samples, and gradually improving the prediction reasoning effect of the model.

The foregoing technical solutions of the embodiments of the present invention will be described in detail with reference to specific application examples, and reference may be made to the foregoing related description for details of the implementation process that are not described.

The embodiment of the invention discloses an entity identification method based on tag class probability distribution difference, sample selection is realized through an active learning algorithm based on tag probability distribution difference, and the active learning algorithm based on tag probability distribution difference is adopted to quantify the labeling value of an nonstandard sample by comparing the probability distribution difference value degree of the number of entity types of the nonstandard sample and the labeling sample, so that labels are focused on samples related to low recall.

The sample selection method and the entity identification method of the embodiment of the invention comprise the following steps:

1. labeling a sample

Randomly extracting part of samples L from an unlabeled sample set U, and manually labeling the samples L through a labeling platform to form a labeled sample set; since the probability distribution-based sampling model algorithm P is formed by calculating the data of the entity types of the marked samples, the marking of the samples is improved on the basis of adopting the traditional BIO form, and the quantity of the types of different entities is added. Wherein in BIO form, B is an abbreviation of begin, representing the beginning of an entity; i is an abbreviation for inside, representing the middle of the entity; o is an abbreviation for outlide and represents a phrase that does not belong to noun.

Suppose that labeling a sample yields two types of entities: the first type of entity is a name of a person, and the second type of entity is a place; examples are: beijing, name of people is c, place is d, and the labeling result is [ c: { [0,1, "small bright" ], [2,3, "small red" ],2}, d: { [7,8, "Beijing" ],1} ], the number of person's name c is the number 2 before the first }, and the number of places d is the number 1 before the second }.

2. Constructing a preliminary entity recognition model, adopting a deep learning technology such as BERT-CRF and BiLSTN-CRF, performing multi-round training on the labeling samples of the labeling sample set through a training platform, and obtaining an entity recognition model M when the test set tends to be stable;

3. the specific steps for constructing the sampling model P based on the probability distribution difference are as follows:

for the labeling sample set, according to the labeling result of each labeling sample, the number of entities of different categories of each labeling sample is obtained, and the number of the two types of entities (namely, names and places) existing in each labeling sample is based on:i represents the label of the marked sample, ">Is the number of entities (number of names) corresponding to the first entity type in the ith annotation strip sample, is >Is the number of entities (number of places) corresponding to the second entity type in the ith marked sample, and forms a vector X according to the number of times the first entity type appears in each marked sample ^b And forming a vector X based on the number of times the second type of entity appears in each labeled sample ^c N represents the number of labeled strip samples.

Based on vector X ^b And X ^c Calculating a Gaussian probability density function of the number of each type of entity of the marked sample:

and taking the Gaussian probability density function as a sampling model based on the probability distribution difference.

Wherein mu _X Is the mean matrix mu _X ＝[X ^b ,X ^c ]Is the mean value of each type of entity in the labeling sample set; for example for X ^b ，μ _X Is all X of this type ^b Is the average value of (2); for X ^c ，μ _X Representing all X's of this type ^c Is the average value of (2); if there are other types of entities, then so on. Sigma is the covariance matrixx represents the vector of the labeled sampleT represents the transpose matrix and m represents the number of entity types.

The gaussian probability density f (x) of the number of entity types in the labeling sample set can be obtained according to the formula (3), and in order to better describe the abnormality degree of the entity type distribution, the formula (3) is further normalized to obtain a sample sampling function P (x):

The sample sampling function P (x) is used as a final sampling model based on probability distribution difference, wherein P (x) compresses the difference value in the range of [0,1 ].

4. Inference entity

And (3) reasoning the unlabeled samples in the unlabeled sample set U by using the entity identification model M to obtain a reasoning result, wherein the reasoning result comprises: the entities, the types to which each entity corresponds, and the number of each type of entity.

5. The method is characterized in that whether the reasoning result is correct or not is not judged manually, the reasoning result is directly input into a sampling model P based on probability distribution difference to obtain probability distribution difference values of various types of entities, the larger the probability distribution difference values are, the larger the difference between the entity type distribution of unlabeled samples and the entity type distribution of labeled data sets is, namely the larger the entity type difference characteristics of the labeled samples are, the abnormal values are more worth focusing. And sorting the difference distribution values of unlabeled samples. And taking the unlabeled samples which are not lower than the threshold value as the supplementary samples to form a supplementary data set N, and delivering the supplementary data set N to a labeling person for labeling. And adding the supplementary data set into the labeling sample set, and training and updating the entity identification model again, wherein the updated entity identification model can identify the entity with the same entity type as that in the supplementary sample, so that the recall rate is improved.

Specific examples of constructing and applying the sampling model P based on the probability distribution differences are as follows: taking actual data of a medical examination text as an example, 1000 labeling samples are arranged in a labeling sample set, and each labeling sample is labeled with two types of medical entities of an examination part and an examination method to form a vector X of the two types of medical entities of the examination part and the examination method ^b And X ^c . These two 1000-dimensional vectors X ^b 、X ^c The number of occurrences of the corresponding medical entity in each labeled sample is shown:

X ^b ＝[1,2,3...，2,1]

X ^c ＝[1,1,3,...,1,1]

then based on X ^b And X ^c Calculating the mean matrix mu _X Sum covariance matrix sigma, mean matrix mu _X And covariance matrix Σ is a key parameter of sample sampling function P (x):

μ _X ＝[1.55,1.52]

thus, a sample sampling function P (x) is obtained:

as shown in fig. 6, the probability distribution function obtained based on the 1000 labeling samples described above. The X axis and the Y axis respectively correspond to the number of each type of entity of the checking position and the checking method, and the units of the X axis and the Y axis are 100; z represents the value of P (x). The Z value describes the probability distribution difference of the marked sample, and the more similar to the probability distribution of the marked sample, the smaller the Z value. The more the probability distribution of the labeled sample is deviated, the larger the Z value is. The larger the Z value is, the larger the difference between the labeling sample and the labeling data set is, and the more worth labeling is provided for the entity recognition model learning training to update the entity recognition model, so that the entity recognition performance of the entity recognition model is improved.

Assume that unlabeled sample is s ₁ Sum s ₂ ：

s ₁ Chest coronal imaging adduction;

s ₂ =increase sternal side position;

the entity recognition model is used for carrying out reasoning recognition, and the output reasoning result is as follows:

p ₁ = [ b: { [0,1, "chest ]"]，1}，c：{[]，0}]；

p ₂ = [ b: { [2,3, "chest ]"]1, c: { [4,5, "side position ]"]，1}]；

To unlabeled sample s ₁ Sum s ₂ Corresponding vector x ₁ ＝[1,0]And x ₂ ＝[1,1]Respectively input to a sample sampling function P (x) to output corresponding probability distribution difference values：

P(x ₁ )＝0.988；

P(x ₂ )＝0.166；

Due to s ₁ Probability distribution difference value P (x) ₁ ) Significantly greater than s ₂ Probability distribution difference value P (x) ₂ ) And P (x) ₁ ) Greater than the threshold, so that the sample s will not be labeled ₁ After the manual labeling is carried out, the entity recognition performance of the model is improved greatly when the entity recognition model is trained again.

6. Labeling the supplementary data set N by labeling personnel, and putting the supplementary data set N into a labeling data set to obtain an updated labeling data set: l=l+n;

7. updating a key parameter mean matrix mu of a sampling function P corresponding to a sampling model based on probability distribution difference through an updated labeling sample set L _X And the value of the covariance matrix Σ;

8. training the entity recognition model M again through the updated sample set L;

9. verifying the updated entity identification model M in the test set to obtain an identification result; judging whether the accuracy of the identification result reaches a preset accuracy;

10. Stopping training when the accuracy of the identification result reaches the preset accuracy;

and repeating the iteration steps four to nine aiming at a new unlabeled sample when the accuracy of the identification result does not reach the preset accuracy, and stopping iteration after the entity identification model reaches the required accuracy.

It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. As will be apparent to those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".

Those of skill in the art will further appreciate that the various illustrative logical blocks (illustrative logical block), units, and steps described in connection with the embodiments of the invention may be implemented by electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components (illustrative components), elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation is not to be understood as beyond the scope of the embodiments of the present invention.

The various illustrative logical blocks or units described in the embodiments of the invention may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described. A general purpose processor may be a microprocessor, but in the alternative, the general purpose processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. In an example, a storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may reside in a user terminal. In the alternative, the processor and the storage medium may reside as distinct components in a user terminal.

In one or more exemplary designs, the above-described functions of embodiments of the present invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer readable media includes both computer storage media and communication media that facilitate transfer of computer programs from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media may include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store program code in the form of instructions or data structures and other data structures that may be read by a general or special purpose computer, or a general or special purpose processor. Further, any connection is properly termed a computer-readable medium, e.g., if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless such as infrared, radio, and microwave, and is also included in the definition of computer-readable medium. The disks (disks) and disks (disks) include compact disks, laser disks, optical disks, DVDs, floppy disks, and blu-ray discs where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included within the computer-readable media.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method of sample selection, comprising:

inference entity: inputting unlabeled samples into a preliminary entity recognition model to perform reasoning recognition on the entities in the unlabeled samples, and outputting reasoning results corresponding to the unlabeled samples, wherein the reasoning results comprise: entity, entity type, number of entities of each type;

updating the labeling sample set: taking the unlabeled samples with the probability distribution difference value not lower than a threshold value as supplementary samples; the supplementary sample is used for being supplemented to the labeling sample set to form an updated labeling sample set, and the updated labeling sample set is used for training the preliminary entity recognition model again to obtain a final entity recognition model.

2. The sample selection method according to claim 1, further comprising:

the method comprises the steps of constructing a sampling model based on probability distribution differences, wherein the construction of the sampling model based on probability distribution differences specifically comprises the following steps:

obtaining a labeling result of each labeling sample in the labeling sample set after manual labeling, wherein the labeling result comprises the following steps: entity, entity type, number of entities of each type;

and determining a Gaussian probability density function of the entity type based on the number of the entities of each type in each labeling sample, and taking the Gaussian probability density function as a sampling model based on probability distribution difference.

3. The sample selection method according to claim 2, wherein the constructing a sampling model based on a probability distribution difference specifically comprises:

4. The sample selection method according to claim 1, further comprising:

5. A method of identifying a named entity, comprising the sample selection method of any one of claims 1-4;

the named entity identification method further comprises the following steps:

6. The method for identifying a named entity according to claim 5, wherein the updating the entity identification model specifically comprises:

the named entity identification method further comprises the following steps:

and (3) testing and verifying: testing and verifying a current entity identification model by adopting a test set to obtain an identification result, and judging whether the accuracy of the identification result reaches a preset accuracy;

when the accuracy of the identification result does not reach the preset accuracy, sequentially cycling the step of reasoning the entity, the step of calculating the probability distribution difference, the step of updating the labeling sample set, the step of labeling the sample and the step of updating the entity identification model aiming at the new unlabeled sample until the identification result obtained by testing and verifying the current entity identification model reaches the preset accuracy in the step of testing and verifying, and stopping sequentially cycling, wherein the current entity identification model is used as a final entity identification model;

and when the accuracy of the identification result reaches the preset accuracy, taking the current entity identification model as a final entity identification model.

7. A sample selection system, comprising:

The entity reasoning unit is used for inputting unlabeled samples into the preliminary entity recognition model to perform reasoning recognition on the entities in the unlabeled samples, and outputting reasoning results corresponding to the unlabeled samples, wherein the reasoning results comprise: entity, entity type, number of entities of each type;

the probability distribution difference calculation unit is used for determining probability distribution difference values of various types of entities in the reasoning results based on a sampling model of probability distribution differences according to the reasoning results corresponding to each unlabeled sample;

labeling sample set updating unit: the unlabeled samples with the probability distribution difference value not lower than a threshold value are used as supplementary samples; the supplementary sample is used for being supplemented to the labeling sample set to form an updated labeling sample set, and the updated labeling sample set is used for training the preliminary entity recognition model again to obtain a final entity recognition model.

8. The sample selection system of claim 7, further comprising a probability distribution difference-based sampling model building unit configured to:

9. The sample selection system according to claim 8, wherein the means for constructing the sampling model based on the difference in probability distribution comprises:

and the method is also used for carrying out normalization processing on the Gaussian probability density function, so that the value of the Gaussian probability density function is controlled in the range of [0,1] to obtain a sample sampling function, and the sample sampling function is used as a final sampling model based on probability distribution difference.

10. The sample selection system of claim 7, further comprising:

11. A named entity recognition system comprising the sample selection system of any one of claims 7-10;

the named entity recognition system further comprises:

12. The named entity recognition system of claim 11, further comprising a test unit, wherein:

the entity recognition model training unit is specifically configured to train a preliminary entity recognition model by using all the updated labeling samples in the labeling sample set to obtain a current entity recognition model; when the accuracy of the identification result obtained in the test unit reaches the preset accuracy, taking the current entity identification model as a final entity identification model;

the test unit is used for carrying out test verification on the current entity identification model by adopting a test set to obtain an identification result, and judging whether the accuracy of the identification result reaches a preset accuracy;

The entity reasoning unit, the probability distribution difference calculating unit, the labeling sample set updating unit, the labeling unit and the entity recognition model training unit are all used for sequentially and circularly executing the entity reasoning unit, the probability distribution difference calculating unit, the labeling sample set updating unit, the labeling unit and the entity recognition model training unit aiming at new unlabeled samples when the accuracy of the recognition result obtained in the test unit does not reach the preset accuracy, until the recognition result obtained by testing and verifying the current entity recognition model in the test unit reaches the preset accuracy, and stopping sequentially and circularly executing the entity reasoning unit, the probability distribution difference calculating unit, the labeling sample set updating unit, the labeling unit and the entity recognition model training unit.

13. A computer readable storage medium, characterized in that the computer readable storage medium stores one or more programs, which when executed by a computer device, cause the computer device to perform the sample selection method of any of claims 1-4.

14. A computer device, comprising:

a processor; and a memory arranged to store computer executable instructions which, when executed, cause the processor to perform the sample selection method of any of claims 1-4.