CN112560459A

CN112560459A - Sample screening method, device, equipment and storage medium for model training

Info

Publication number: CN112560459A
Application number: CN202011407811.5A
Authority: CN
Inventors: 计云杰; 戴岱; 肖欣延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-03-26
Anticipated expiration: 2040-12-04
Also published as: CN112560459B

Abstract

The application discloses a sample screening method, a sample screening device, sample screening equipment and a storage medium for model training, and relates to the field of natural language processing. The model is used for identifying the predicted entity words contained in each sample and identifying the probability that the predicted entity words belong to each entity category respectively, and the specific implementation scheme is as follows: the method comprises the steps of obtaining a set of entity words, determining noise words from the set according to the probability of the corresponding entity word output by a model belonging to each entity category and the difference value between the first probability of the entity word belonging to a prediction category and the second probability of the entity word belonging to a labeling category, and deleting samples to which the noise words belong, so that the noise words in the set are screened out based on the first probability of the entity word belonging to the prediction category and the second probability of the entity word belonging to the labeling category output by the model, samples containing the noise words are screened out, and the accuracy of model training is improved.

Description

Sample screening method, device, equipment and storage medium for model training

Technical Field

The application discloses a sample screening method, a sample screening device, sample screening equipment and a storage medium for model training, and relates to the technical field of natural language processing.

Background

The purpose of the entity recognition task is to extract fields of interest from the text, such as person names, place names, etc. In particular, the medical field may focus on fields of symptoms, diseases, drugs, and the like. At present, an entity identification method based on deep learning obtains a good extraction effect, and the good identification effect of a deep learning model depends on high-quality artificial labeling data.

However, in a more complicated field, such as medical treatment, because stronger medical background knowledge is needed, the labeling difficulty is higher, and the category of each entity word in the manually labeled sample often has errors, i.e., noise. In this case, the deep learning model is affected by noise, and the desired effect cannot be obtained. Therefore, an effective method for finding the noise in the text is provided to correct the wrong text, and the method has great significance for improving the quality of the model training sample.

Disclosure of Invention

The application provides a sample screening method, a sample screening device, sample screening equipment and a storage medium for model training.

An embodiment of a first aspect of the present application provides a sample screening method for model training, where the model is used to identify a predicted entity word included in each sample, and identify probabilities that the predicted entity word belongs to each entity category, respectively, and the method includes:

acquiring a set of entity words, wherein the entity words in the set are the predicted entity words obtained by identifying each sample by the model and/or the labeled entity words of each sample;

for the entity words in the set, according to the probability of the corresponding entity word output by the model belonging to each entity category, determining a first probability of the corresponding entity word belonging to the prediction category and a second probability of the corresponding entity word belonging to the labeling category;

determining a noise word from the set according to a difference between the first probability and the second probability;

deleting the sample to which the noise word belongs.

An embodiment of a second aspect of the present application provides a sample screening apparatus for model training, where the model is configured to identify a predicted entity word included in each sample, and identify probabilities that the predicted entity word belongs to each entity category, respectively, and the apparatus includes:

the obtaining module is used for obtaining a set of entity words, wherein the entity words in the set are the predicted entity words obtained by identifying each sample by the model and/or the labeled entity words of each sample;

a first determining module, configured to determine, for the entity words in the set, a first probability that the corresponding entity word belongs to the prediction category and a second probability that the corresponding entity word belongs to the labeling category according to the probabilities, output by the model, that the corresponding entity word belongs to each entity category;

a second determining module, configured to determine a noise word from the set according to a difference between the first probability and the second probability;

and the deleting module is used for deleting the sample to which the noise word belongs.

An embodiment of a third aspect of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for model training of the example of the first aspect.

A fourth aspect of the present application provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the sample screening method for model training described in the first aspect.

An embodiment of a fifth aspect of the present application provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the sample screening method for model training described in the foregoing embodiment.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flowchart of a sample screening method for model training according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a sample screening method for model training according to the second embodiment of the present application;

fig. 3 is a schematic flowchart of a sample screening method for model training provided in the third embodiment of the present application;

FIG. 4 is an exemplary diagram of an object matrix provided in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a sample screening apparatus for model training according to the fourth embodiment of the present application

Fig. 6 is a block diagram of an electronic device for implementing the sample screening method for model training according to the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, a belief learning algorithm is usually adopted to find noise in data, but the belief learning algorithm is adopted to identify the noise in the data, so that a good effect can be achieved only on the problems of image classification, text classification and the like, and noise words in entity words cannot be identified.

For example, suppose the text to be recognized is "Wangming goes to the Beijing park to kite today, and the Beijing park is located in the central area of Beijing City. "in this text," wangxingming "should be labeled as PER, i.e. a character, while" north sea park "," beijing city "should be labeled as LOC, i.e. a place name. Meanwhile, the annotating personnel identifies the 'Beijing park' and correctly annotates the category of the 'Beijing park', but neglects to mark the 'Beijing City', and wrongly marks the 'Wangming' as LOC, and the model identifies the 'Beijing City' and the 'Wangming' and correctly predicts the category of the 'Beijing City' and the 'Wangming' but neglects to predict the 'Beijing park'.

Furthermore, a term may not belong to any entity class, e.g., "located" cannot be labeled as PER or LOC.

In view of this, the present application provides a sample screening method for model training, where the model is used to identify predicted entity words included in each sample and to identify probabilities that the predicted entity words respectively belong to each entity category, and by obtaining a set of entity words, for the entity words in the set, a first probability that a corresponding entity word belongs to a predicted category and a second probability that the corresponding entity word belongs to a labeled category are determined according to the probabilities that the corresponding entity word output by the model belongs to each entity category, and a noise word is determined from the set according to a difference between the first probability and the second probability, so as to delete the sample to which the noise word belongs.

A sample screening method, an apparatus, a device, and a storage medium for model training according to embodiments of the present application are described below with reference to the drawings.

Fig. 1 is a schematic flowchart of a sample screening method for model training according to an embodiment of the present disclosure.

The embodiment of the present application is exemplified in that the sample screening method for model training is configured in a sample screening apparatus for model training, and the sample screening apparatus for model training can be applied to any electronic device, so that the electronic device can perform a sample screening function for model training.

The electronic device may be a Personal Computer (PC), a cloud device, a mobile device, and the like, and the mobile device may be a hardware device having various operating systems, such as a mobile phone, a tablet Computer, a Personal digital assistant, a wearable device, and an in-vehicle device.

As shown in fig. 1, the sample screening method for model training may include the following steps:

step 101, a set of entity words is obtained.

And the entity words in the set are predicted entity words obtained by identifying each sample by the model, and/or labeled entity words of each sample.

In the embodiment of the application, the model is used for identifying the predicted entity words contained in each sample and identifying the probability that the predicted entity words belong to each entity category respectively.

As an example, assume a sample of "Wangming going to the North sea park to kite today, the North sea park is located in the central area of Beijing City. "the sample is input into a model, which can identify the predicted entity words contained in the sample as" wangming "," north sea park "and" beijing city ". The entity categories corresponding to the predicted entity words are people and places, and the model can identify the probability that the predicted entity words belong to each entity category respectively. For example, the model may identify that "wangming" belongs to a person with a 95% probability and "wangming" belongs to a place with a 5% probability.

In a possible case, entity words included in the acquired entity word set may identify, for the model, more predicted entity words obtained for each sample. For example, the words of the predicted entity obtained by model recognition samples in the above examples are "wangming", "beihai park", "beijing city". The predicted entity words "wangximing", "beihai park", "beijing city" may constitute a set of entity words.

It should be explained that the entity words in the set may be predicted entity words obtained by identifying a plurality of samples for the model, and are not limited to predicted entity words obtained by identifying one sample by the model.

In another possible case, the entity words included in the acquired entity word set may also be labeled entity words of each sample.

It can be understood that after the samples are obtained, the entity words in the samples can be manually labeled to obtain labeled entity words of the samples, and then the labeled entity words of the samples are combined into a set of entity words.

In another possible case, the entity words included in the acquired set of entity words may be predicted entity words obtained by identifying each sample for the model and labeled entity words of each sample.

It can be understood that after obtaining each sample, each sample may be input into the model to obtain a predicted entity word output by the model for identifying each sample, and the entity words in each sample are manually labeled to obtain a labeled entity word of each sample, and further, the predicted entity word and the labeled entity word form a set of entity words.

And 102, determining a first probability that the corresponding entity word belongs to the prediction category and a second probability that the corresponding entity word belongs to the labeling category according to the probability that the corresponding entity word output by the model belongs to each entity category for the entity words in the set.

For the convenience of distinction, the probability that the entity word in the set belongs to the prediction category is referred to as a first probability, and the probability that the entity word belongs to the labeling category is referred to as a second probability.

In the embodiment of the application, after the samples are obtained, the samples are input into the model, and the model can identify the predicted entity words contained in the samples and the probabilities that the predicted entity words belong to the entity categories respectively. The entity category may include a prediction category and an annotation category, among others.

It is understood that, for each entity word in the set, a first probability that the corresponding entity word belongs to the prediction category and a second probability that the corresponding entity word belongs to the annotation category may be determined according to the probabilities that the entity words output by the model belong to the prediction category and the annotation category.

Continuing with the above example, assume that the entity word is "North sea park," which is properly labeled as a place, but the model does not predict that the entity word is an entity. In this case, the model outputs a first probability of the entity word belonging to the predicted category of 90% and a second probability of belonging to the labeled category of 10%.

And 103, determining the noise word from the set according to the difference value between the first probability and the second probability.

In the embodiment of the application, for each entity word in the set, after a first probability that each entity word belongs to a prediction category and a second probability that each entity word belongs to a labeling category are determined, a difference value between the first probability and the second probability corresponding to each entity word is calculated. And determining whether each entity word is a noise word according to the difference value between the first probability of the entity word belonging to the prediction category and the second probability of the entity word belonging to the labeling category.

In a possible case, a difference threshold between the first probability and the second probability may be preset, and after determining the first probability that each entity word belongs to the prediction category and the second probability that each entity word belongs to the labeling category, the difference between the first probability and the second probability may be determined. And if the difference value between the first probability and the second probability of a certain entity word is smaller than the set difference value threshold, determining that the entity word is not a noise word.

In another possible case, if a difference between the first probability and the second probability of a certain entity word is greater than a set difference threshold, the entity word is determined to be a noise word.

It can be understood that, for the entity words in the set, when the difference between the first probability that the corresponding entity word output by the model belongs to the prediction category and the second probability that the corresponding entity word belongs to the labeling category is large, in this case, the manually labeled entity word category is inconsistent with the entity word category predicted by the model, and in order to improve the accuracy of the model, the entity word with the large difference between the first probability that the entity word belongs to the prediction category and the second probability that the entity word belongs to the labeling category may be determined as the noise word.

And 104, deleting the sample to which the noise word belongs.

In the embodiment of the present application, after the noise word is determined from the set, the sample to which the noise word belongs is further determined, so as to delete the sample to which the noise word belongs. Therefore, samples containing noise words can be screened out, and accuracy of model training is improved.

As an example, sample a may be deleted assuming that entity word a is determined to be a noise word from the set of entity words, and belongs to sample a.

The model is used for identifying the predicted entity words contained in each sample and identifying the probability that the predicted entity words respectively belong to each entity category, the noise words are determined from the set according to the probability that the corresponding entity words output by the model belong to each entity category and the difference value between the first probability that the entity words belong to the predicted category and the second probability that the entity words belong to the labeled category and the sample to which the noise words belong is deleted for the entity words in the set by acquiring the set of the entity words, so that the noise words in the set are screened out based on the first probability that the entity words output by the model belong to the predicted category and the second probability that the entity words belong to the labeled category, and the sample containing the noise words is screened out, thereby being beneficial to improving the accuracy of model training.

On the basis of the above embodiment, in a possible scenario, the prediction category and the tagging category of the entity word may be non-entity word categories. In this case, the first probability that the entity word belongs to the prediction class and the second probability that the entity word belongs to the labeling class cannot be determined directly from the output of the model, and in this case, when the first probability and the second probability of the entity word are determined, it is necessary to consider setting the probability upper limit. Fig. 2 is a schematic flowchart of a sample screening method for model training according to a second embodiment of the present disclosure.

As shown in fig. 2, the sample screening method for model training may further include the following steps:

in step 201, a set of entity words is obtained.

In the embodiment of the present application, the implementation process of step 201 may refer to the implementation process of step 101 in the foregoing embodiment, and is not described herein again.

Step 202, for the entity words in the set, if the prediction category of the entity word is a non-entity word category, the corresponding second probability is the probability that the model outputs the entity word belonging to the labeled category, and the first probability is the difference between the set probability upper limit and the second probability.

In the embodiment of the application, the prediction category and the labeling category of the entity word output by the model both include a non-entity word category and each entity category.

The term "non-entity" may refer to a category to which a word that does not belong to an entity belongs, such as "located", "where", and the like.

In one possible case, for the entity words in the set, if the prediction category of the corresponding entity word output by the model is the non-entity word category, in this case, the second probability that the entity word belongs to the tagging category is the probability that the entity word output by the model belongs to the tagging category. The first probability that the entity word belongs to the prediction category is the difference between the set probability upper limit and the second probability.

Wherein the upper limit of the set probability may be 1.

As an example, assuming that the prediction category of the entity word B in the set is a non-entity word category, and the entity word B has a second probability of belonging to the labeled category, the probability that the entity word B belongs to the labeled category may be output as a model, and the first probability that the entity word B belongs to the prediction category may be a difference of 1 minus the second probability.

Step 203, for the entity words in the set, if the labeled type of the entity word is a non-entity word type, the corresponding first probability is the probability of the prediction type output by the model, and the second probability is the difference between the set probability upper limit and the first probability.

In one possible case, for the entity words in the set, if the labeled category of the corresponding entity word output by the model is a non-entity word category, in this case, a first probability that the entity word belongs to the prediction category is a probability of the prediction category output by the model, and a second probability that the entity word belongs to the labeled category is a difference between the set probability upper limit and the first probability.

It can be understood that, because the word is not labeled as the non-entity word class manually, if the labeled class of the entity word output by the model is the non-entity word class, in this case, the probability that the entity word output by the model is the labeled class may be determined according to the difference between the set probability upper limit and the first probability.

As an example, assuming that the labeling category of the entity word C in the set is a non-entity word category, a first probability that the entity word C belongs to the prediction category may be output by the model, and a second probability that the entity word C belongs to the labeling category may be a difference of 1 minus the first probability.

And step 204, determining noise words from the set according to the difference value between the first probability and the second probability.

Step 205, deleting the sample to which the noise word belongs.

In the embodiment of the present application, the implementation processes of step 204 and step 205 may refer to the implementation processes of step 103 and step 104 in the foregoing embodiment, and are not described herein again.

According to the sample screening method for model training in the embodiment of the application, after a set of entity words is obtained, for the entity words in the set, if the prediction type of the entity words is a non-entity word type, the corresponding second probability is the probability which is output by a model and belongs to a labeling type, and the first probability is the difference between a set probability upper limit and the second probability; and if the labeled category of the entity word is a non-entity word category, the corresponding first probability is the probability of the prediction category output by the model, the second probability is the difference between the set probability upper limit and the first probability, and further, the noise word is determined according to the difference between the first probability and the second probability so as to delete the sample to which the noise word belongs. Therefore, the method and the device realize the screening of the non-entity words in the set so as to screen out the noise words as the non-entity samples.

On the basis of the foregoing embodiment, in the foregoing step 103, when determining the noise word from the set according to the difference between the first probability and the second probability, a target matrix may be generated according to each entity category of the entity words output by the model, so as to determine the noise word from the entity words characterized by the target element according to the difference between the first probability and the second probability. Referring to fig. 3 for details, fig. 3 is a schematic flowchart of a sample screening method for model training according to a third embodiment of the present disclosure.

As shown in fig. 3, the sample screening method for model training may further include the following steps:

in step 301, a set of entity words is obtained.

Step 302, for the entity words in the set, according to the probability that the corresponding entity word output by the model belongs to each entity category, determining a first probability that the corresponding entity word belongs to the prediction category and a second probability that the corresponding entity word belongs to the labeling category.

In the embodiment of the present application, the implementation processes of step 301 and step 302 may refer to the implementation processes of step 101 and step 102 in the foregoing embodiment, and are not described herein again.

Step 303, generating a target matrix according to the entity words in the set.

The rows in the target matrix correspond to the labeling categories, the columns correspond to the prediction categories, and the elements in the target matrix represent the entity words which accord with the labeling categories corresponding to the rows and the prediction categories corresponding to the columns.

It can be understood that, for the entity words in the set, after determining the first probability that the corresponding entity word belongs to the prediction category and the second probability that the corresponding entity word belongs to the labeling category according to the probability that the corresponding entity word output by the model belongs to each entity category, after determining the category to which each entity word belongs, the target matrix is generated according to the category to which each entity word belongs.

As an example, as shown in fig. 4, it is assumed that the rows in the target matrix correspond to labeled categories, the columns correspond to predicted categories, the entity word "wangming" is artificially and erroneously labeled as LOC category, but the model is predicted as PER category, so the element of the second row and the first column of the target matrix is 1. That is, the entity words in the second row and the second column conform to the labeled category corresponding to the row and the predicted category corresponding to the column. For the entity word "North sea park," it is manually correctly labeled as LOC class, but the model does not predict that it is an entity, so the element in the second row and third column of the object matrix is 1. That is, the entity word "North sea park" is manually labeled as LOC type, but is not recognized by the model. For the entity word "Beijing City", the annotator did not recognize it as an entity, but the model correctly predicted it to be LOC class, so the element in the third row and the second column is 1. That is, the entity word "Beijing City" is not manually labeled as an entity, but the model predicts that it is some type of entity.

Step 304, obtaining target elements from the target matrix.

The target element is an element of which the labeling category corresponding to the row is not matched with the prediction category corresponding to the column.

In the embodiment of the application, after the target matrix is generated according to the entity words in the set, the elements of which the labeling categories corresponding to the rows are not matched with the prediction categories corresponding to the columns can be obtained from the target matrix to serve as the target elements.

It can be understood that the entity word represented by the element whose labeled category corresponding to the row in the target matrix does not match the predicted category corresponding to the column may be a noise word.

Continuing with the example in fig. 4, as can be seen from the matrix in fig. 4, the labeled categories corresponding to the rows in the second row and the first column, the second row and the third column, and the third row and the second column do not match the prediction categories corresponding to the columns, and therefore, the elements in the second row and the first column, the second row and the third column, and the third row and the second column can be obtained as the target elements.

And 305, determining a noise word from the entity words characterized by the target elements according to the difference value between the first probability and the second probability.

In the embodiment of the application, after target elements, of which the labeling categories corresponding to the rows and the prediction categories corresponding to the columns are not matched, are obtained from the target matrix, a difference between a first probability and a second probability that an entity word output by a model belongs to the prediction categories and the second probability that the entity word belongs to the labeling categories is determined, so that a noise word is determined from the entity words represented by the target elements.

As a possible implementation manner, a difference between a first probability that the entity word represented by the target element belongs to the prediction category and a second probability that the entity word belongs to the labeling category may be determined, and if the difference is greater than a set threshold, the entity word represented by the target element may be determined to be a noise word.

As another possible implementation manner, an element, matched with the labeling category corresponding to the row and the prediction category corresponding to the column, may be obtained from the target matrix as a reference element, according to a first probability of the prediction category to which the entity word represented by the reference element belongs, a mean value of the first probabilities of the prediction categories is counted, and the mean value of the first probabilities of the prediction categories is used as a probability threshold of the corresponding prediction category. Further, from the entity words characterized by the target element, a first candidate entity word of which the first probability of the prediction category is greater than the corresponding probability threshold is determined, so as to determine a noise word from the first candidate entity word according to the difference value between the first probability and the second probability.

Continuing with the example in fig. 4, as can be seen from the matrix in fig. 4, the labeled category corresponding to the row of the diagonal element of the matrix matches the predicted category corresponding to the column, and the diagonal element can be used as the reference element.

In the embodiment of the application, after the probability threshold of each prediction category is determined, the entity words with the first probability of the prediction category, which belongs to the entity words characterized by the target element, larger than the corresponding probability threshold are used as the first candidate entity words. And determining the noise words from the first candidate entity words according to the difference value between the first probability of each entity word in the first candidate entity words output by the model, belonging to the prediction category, and the second probability of each entity word belonging to the labeling category. Therefore, words with different prediction types and labeling types of the entity words output by the model can be screened out as noise words.

In the embodiment of the application, after the elements matched with the labeling categories corresponding to the rows and the prediction categories corresponding to the columns are obtained from the target matrix as reference elements, second candidate entity words with the first probability greater than the corresponding probability threshold of the prediction categories can be determined from the entity words characterized by the reference elements. And further, generating a counting matrix according to the number of the first candidate entity words represented by the target elements and the number of the second candidate entity words represented by the reference elements.

The counting matrix is used for indicating the number of the first candidate words or the second candidate words represented by each element in the corresponding target matrix.

And generating a joint probability distribution between the prediction category and the labeling category according to the counting matrix, and determining the proportion of the noise words according to the joint probability distribution so as to determine the noise words in the entity words represented by the reference elements.

For example, in the counting matrix, for each element corresponding to the matrix position of the non-diagonal line, that is, the element whose labeled category corresponding to the row does not match with the prediction category corresponding to the column, a joint probability distribution between the prediction category and the labeled category may be generated according to a first probability that the corresponding entity word output by the model belongs to the prediction category and a second probability that the corresponding entity word belongs to the labeled category.

Step 306, deleting the sample to which the noise word belongs.

In the embodiment of the present application, the implementation process of step 306 may refer to the implementation process of step 104 in the foregoing embodiment, and is not described herein again.

According to the sample screening method for model training, for the entity words with the labeling type inconsistent with the prediction type, whether the entity words belong to the noise words or not is determined according to the difference value between the first probability that the corresponding entity words output by the model belong to the prediction type and the second probability that the corresponding entity words belong to the labeling type. Therefore, the entity words with different labeling types and prediction types can be identified, so that the labeling personnel can correct the types of the entity words with wrong labeling, and the data quality is improved.

In order to realize the above embodiments, the present application provides a sample screening apparatus for model training.

Fig. 5 is a schematic structural diagram of a sample screening apparatus for model training according to a fourth embodiment of the present application.

The model is used for identifying the predicted entity words contained in the samples and identifying the probability that the predicted entity words belong to the entity categories respectively.

As shown in fig. 5, the sample screening apparatus 500 for model training may include: an acquisition module 510, a first determination module 520, a second determination module 530, and a deletion module 540.

The obtaining module 510 is configured to obtain a set of entity words, where the entity words in the set are predicted entity words obtained by identifying, by the model, each sample, and/or labeled entity words of each sample.

The first determining module 520 is configured to determine, for the entity words in the set, a first probability that the corresponding entity word belongs to the prediction category and a second probability that the corresponding entity word belongs to the labeling category according to the probabilities, output by the model, that the corresponding entity word belongs to each entity category.

A second determining module 530, configured to determine a noise word from the set according to a difference between the first probability and the second probability;

and the deleting module 540 is configured to delete the sample to which the noise word belongs.

As a possible situation, the prediction category and the labeling category both include a non-entity word category and each entity category; the first determining module 520 may be further configured to:

if the prediction category of the entity word is a non-entity word category, the corresponding second probability is the probability which is output by the model and belongs to the label category, and the first probability is the difference between the set probability upper limit and the second probability; if the labeled category of the entity word is a non-entity word category, the corresponding first probability is the probability of the prediction category output by the model, and the second probability is the difference between the set probability upper limit and the first probability.

As another possible case, the second determining module 530 may further include:

the generating unit is used for generating a target matrix according to the entity words in the set; the rows in the target matrix correspond to the labeling categories, the columns correspond to the prediction categories, and the elements in the target matrix represent the entity words which accord with the labeling categories corresponding to the rows and the prediction categories corresponding to the columns.

And the acquisition unit is used for acquiring target elements from the target matrix, wherein the target elements are elements of which the labeling categories corresponding to the rows are not matched with the prediction categories corresponding to the columns.

And the determining unit is used for determining the noise words from the entity words represented by the target elements according to the difference value between the first probability and the second probability.

As another possible scenario, the determining unit may be further configured to:

acquiring reference elements from the target matrix, wherein the reference elements are elements matched with the labeling categories corresponding to the rows and the prediction categories corresponding to the columns; according to the first probability of the prediction category to which the entity word represented by the reference element belongs, counting the mean value of the first probability of each prediction category, and taking the mean value of the first probability of each prediction category as the probability threshold value of the corresponding prediction category; determining a first candidate entity word of which the first probability of the prediction category is greater than the corresponding probability threshold from the entity words characterized by the target element; and determining the noise word from the first candidate entity word according to the difference value between the first probability and the second probability.

determining a second candidate entity word of which the first probability of the prediction category is greater than the corresponding probability threshold from the entity words characterized by the reference elements; generating a counting matrix according to the number of the first candidate entity words represented by the target elements and the number of the second candidate entity words represented by the reference elements; the counting matrix is used for indicating the number of the first candidate words or the second candidate words represented by each element in the corresponding target matrix; generating joint probability distribution between the prediction category and the labeling category according to the counting matrix; and determining the proportion of the noise words according to the joint probability distribution.

It should be noted that the foregoing explanation of the embodiment of the sample screening method for model training is also applicable to the sample screening apparatus for model training, and is not repeated herein.

The sample screening device for model training according to the embodiment of the application is used for identifying the predicted entity words contained in each sample and identifying the probability that the predicted entity words belong to each entity category respectively, determining the noise words from the set according to the probability that the corresponding entity words output by the model belong to each entity category and the difference between the first probability that the entity words belong to the predicted category and the second probability that the entity words belong to the labeled category by obtaining the set of the entity words and deleting the samples to which the noise words belong for the entity words in the set, and therefore, screening the noise words in the set based on the first probability that the entity words output by the model belong to the predicted category and the second probability that the entity words belong to the labeled category to screen the samples containing the noise words so as to improve the accuracy of model training.

In order to implement the above embodiments, the present application also provides a server.

The server provided by the application can comprise:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the sample screening method for model training of the above embodiments.

To implement the above embodiments, the present application also proposes a non-transitory computer-readable storage medium storing computer instructions.

The non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the sample screening method for model training described in the above embodiments is provided in the embodiments of the present application.

In order to implement the above embodiments, the present application further proposes a computer program product, which includes a computer program, and when being executed by a processor, the computer program implements the sample screening method for model training described in the above embodiments.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 6 is a block diagram of an electronic device for a sample screening method for model training according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the sample screening methods for model training provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the sample screening method for model training provided herein.

The memory 602, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the sample screening method for model training in the embodiments of the present application (e.g., the obtaining module 510, the first determining module 520, the second determining module 530, and the deleting module 540 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 602, namely, implements the sample screening method for model training in the above method embodiments.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for the sample screening method for model training may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a Virtual Private Server (VPS).

According to the technical scheme of the embodiment of the application, the set of entity words is obtained, the entity words in the set are determined according to the probability that the corresponding entity words output by the model belong to each entity category, and the noise words are determined from the set according to the difference value between the first probability that the entity words belong to the prediction category and the second probability that the entity words belong to the labeling category, so that the samples to which the noise words belong are deleted, therefore, the noise words in the set are screened out based on the first probability that the entity words output by the model belong to the prediction category and the second probability that the entity words belong to the labeling category, so that the samples containing the noise words are screened out, and the accuracy of model training is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A sample screening method for model training, the model being configured to identify predicted entity words included in respective samples and to identify probabilities that the predicted entity words respectively belong to respective entity categories, the method comprising:

for the entity words in the set, according to the probability of the corresponding entity word output by the model belonging to each entity category, determining a first probability that the corresponding entity word belongs to a prediction category and a second probability that the corresponding entity word belongs to a labeling category;

deleting the sample to which the noise word belongs.

2. The sample screening method according to claim 1, wherein the prediction category and the labeling category each include a non-entity word category and the entity categories;

the determining, for the entity words in the set, a first probability that the corresponding entity word belongs to the prediction category and a second probability that the corresponding entity word belongs to the labeling category according to the probability that the corresponding entity word output by the model belongs to each entity category includes:

if the prediction category of the entity word is the non-entity word category, the corresponding second probability is the probability which is output by the model and belongs to the label category, and the first probability is the difference between the set probability upper limit and the second probability;

if the labeled category of the entity word is the category of the non-entity word, the corresponding first probability is the probability of the prediction category output by the model, and the second probability is the difference between the set probability upper limit and the first probability.

3. The sample screening method of claim 1, wherein said determining noise words from the set based on the difference between the first probability and the second probability comprises:

generating a target matrix according to the entity words in the set; the rows in the target matrix correspond to the labeling categories, the columns correspond to the prediction categories, and the elements in the target matrix represent entity words which accord with the labeling categories corresponding to the rows and the prediction categories corresponding to the columns;

acquiring target elements from the target matrix, wherein the target elements are elements of which the labeling categories corresponding to the rows are not matched with the prediction categories corresponding to the columns;

and determining the noise word from the entity words characterized by the target elements according to the difference value between the first probability and the second probability.

4. The sample screening method of claim 3, wherein the determining the noise word from the entity words characterized by the target element according to the difference between the first probability and the second probability comprises:

acquiring reference elements from the target matrix, wherein the reference elements are elements matched with the labeling categories corresponding to the rows and the prediction categories corresponding to the columns;

according to the first probability of the prediction category to which the entity word represented by the reference element belongs, counting the mean value of the first probability of each prediction category, and taking the mean value of the first probability of each prediction category as the probability threshold value of the corresponding prediction category;

determining a first candidate entity word of which the first probability of the prediction category is greater than the corresponding probability threshold from the entity words characterized by the target element;

and determining the noise word from the first candidate entity word according to the difference value between the first probability and the second probability.

5. The sample screening method of claim 4, wherein the determining the noise word from the first candidate entity word according to the difference between the first probability and the second probability further comprises:

determining a second candidate entity word of which the first probability of the prediction category is greater than the corresponding probability threshold from the entity words characterized by the reference element;

generating a counting matrix according to the number of the first candidate entity words represented by the target element and the number of the second candidate entity words represented by the reference element; the counting matrix is used for indicating the number of the first candidate words or the second candidate words represented by each element in the corresponding target matrix;

generating a joint probability distribution between the prediction category and the labeling category according to the counting matrix;

and determining the proportion of the noise words according to the joint probability distribution.

6. A sample screening apparatus for model training, the model being configured to identify a predicted entity word included in each sample and identify a probability that the predicted entity word belongs to each entity category, respectively, the apparatus comprising:

7. The apparatus for screening samples according to claim 6, wherein the prediction category and the labeling category each include a non-entity word category and the entity categories; the first determining module is further configured to:

8. The specimen screening apparatus of claim 6, wherein the second determination module includes:

the generating unit is used for generating a target matrix according to the entity words in the set; the rows in the target matrix correspond to the labeling categories, the columns correspond to the prediction categories, and the elements in the target matrix represent entity words which accord with the labeling categories corresponding to the rows and the prediction categories corresponding to the columns;

the acquisition unit is used for acquiring target elements from the target matrix, wherein the target elements are elements of which the labeling categories corresponding to the rows are not matched with the prediction categories corresponding to the columns;

and the determining unit is used for determining the noise word from the entity words characterized by the target elements according to the difference value between the first probability and the second probability.

9. The specimen screening apparatus according to claim 8, wherein the determination unit is further configured to:

10. The specimen screening apparatus according to claim 8, wherein the determination unit is further configured to:

11. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for model training sample screening of any of claims 1-5.

12. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the sample screening method for model training of any one of claims 1-5.