CN113593591B

CN113593591B - Corpus noise reduction method and device, electronic equipment and storage medium

Info

Publication number: CN113593591B
Application number: CN202110852412.8A
Authority: CN
Inventors: 牛海波
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2024-06-11
Anticipated expiration: 2041-07-27
Also published as: CN113593591A

Abstract

The disclosure relates to a corpus noise reduction method and device, electronic equipment and storage medium. The method comprises the following steps: acquiring estimated tag distribution of an initial corpus; acquiring a confidence matrix according to the estimated tag distribution, wherein the confidence matrix is used for describing tag noise distribution under the category condition; acquiring noise corpus in the initial corpus set based on the confidence matrix; and processing the noise corpus in the initial corpus set to obtain a target corpus set. In the embodiment, the confidence matrix can be established through the prediction probability of the label and the label labeling, the noise corpus in the initial corpus is identified through the confidence matrix, after the noise corpus is processed, the proportion and ambiguity information occupied by the noise corpus in the target corpus can be reduced, the boundary of the target corpus is clearer, the training times of the vertical domain model are reduced, the calculation resources and the consumed time required by training are further reduced, and the training efficiency is improved.

Description

Corpus noise reduction method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of corpus noise reduction, and in particular relates to a corpus noise reduction method and device, electronic equipment and a storage medium.

Background

With the improvement of semantic understanding capability of intelligent voice assistants, the intelligent voice assistant has become an important application of man-machine intelligent interaction. The existing intelligent voice assistant generally adopts a multi-domain competition mode to realize intelligent service, namely, the intelligent voice assistant initiates requests to a plurality of preset domain service models, each domain service model analyzes the requests, feeds back the provided service to the intelligent voice assistant, and informs the confidence of the provided service; the intelligent voice assistant feeds back the service with the highest confidence to the user. The quality of service provided by the various vertical domain service models is therefore of particular importance. In practical applications, the reasons for influencing the quality of service provided by each vertical domain service model include the vertical domain corpus quality used for training the vertical domain service model.

In practical application, the vertical domain service model corpus sources are various, the expression modes of different application scenes have larger difference, and noise data are easy to introduce; when the semantics of a certain vertical domain are rich, semantic boundary blurring exists and ambiguity information accompanies the semantic boundary blurring, so that the labeling accuracy is reduced; the understanding of the corpus by the labeling personnel is different, so that noise is introduced in the labeling process. Therefore, it is desirable to provide a method for obtaining high quality vertical corpus.

Disclosure of Invention

The disclosure provides a corpus noise reduction method and device, electronic equipment and storage medium, and aims to solve the defects of related technologies.

According to a first aspect of an embodiment of the present disclosure, there is provided a corpus noise reduction method, the method including:

Acquiring estimated tag distribution of an initial corpus;

Acquiring a confidence matrix according to the estimated tag distribution, wherein the confidence matrix is used for describing tag noise distribution under the category condition;

acquiring noise corpus in the initial corpus set based on the confidence matrix;

And processing the noise corpus in the initial corpus set to obtain a target corpus set.

Optionally, obtaining an estimated tag distribution of the initial corpus set includes:

Dividing the initial corpus into K subsets, and sequentially taking each subset of the K subsets as a verification set and the other subsets as training sets; k is a positive integer;

Training a preset vertical domain corpus noise reduction model by sequentially utilizing the training set to obtain a trained vertical domain corpus noise reduction model, and acquiring estimated tag distribution of the verification set by utilizing the trained vertical domain corpus noise reduction model to obtain K estimated tag distribution;

Splicing K estimated tag distributions to obtain estimated tag distributions of the initial corpus; the estimated tag distribution is used to approximate a lossless tag distribution representing the initial corpus.

Optionally, the label distribution of the corpus in each subset is the same as the label distribution of the corpus in the initial corpus set.

Optionally, obtaining a confidence matrix according to the estimated tag distribution includes:

Obtaining the prediction probability of each tag from the estimated tag distribution; the estimated tag distribution comprises the prediction probability that each corpus in the initial corpus is estimated as each tag;

calculating an average value of the prediction probabilities of all the labels, and taking the average value as the confidence level of each label;

Aiming at each corpus, obtaining labels with prediction probability meeting a preset confidence condition; the preset confidence condition means that the prediction probability is required to be the maximum prediction probability exceeding the tag confidence;

counting the number of the linguistic data in the initial corpus set under the label category meeting the preset confidence condition;

constructing a confidence matrix based on the number; the sum of all elements in the confidence matrix is 1.

Optionally, constructing a confidence matrix based on the number includes:

Each label is used as a known labeling label and an unknown lossless label, and an initial confidence matrix for joint distribution between the approximate labeling label and the lossless label is constructed; the element in the nth row and the mth column in the initial confidence matrix represents the corpus quantity which is predicted to be a label y _m by labeling the label y _n;

performing category normalization processing on elements in the initial confidence matrix based on the total number of the linguistic data under each label category in the initial corpus set to obtain a category normalization confidence matrix;

carrying out integral normalization processing on the elements in the category normalization confidence matrix to obtain a final confidence matrix; the normalization processing includes normalization processing and overall normalization processing for each category; the element in the nth row and the mth column in the final confidence matrix represents a normalized value of corpus quantity predicted as y _m by labeling a label as y _n; the sum of all elements in the final confidence matrix is 1.

Optionally, acquiring the noise corpus in the initial corpus set based on the confidence matrix includes:

acquiring non-diagonal corpora corresponding to non-zero elements in the confidence matrix, and obtaining a high-noise corpus set;

Acquiring a corpus meeting the labeling of y _n and having a prediction probability exceeding the confidence level of the label y _m from the high-noise corpus set, wherein the corpus forms a high-noise corpus sub-set Snm; wherein y _n is not equal to y _m;

sequencing all the corpus in the high-noise corpus sub-set Snm according to the prediction probability;

Starting from the corpus with the smallest prediction probability, screening the corpus of the high-noise corpus sub-set Snm, and taking the corpus as the noise corpus in the initial corpus set; the number of the noise corpuses is the product of the preset noise reduction proportion, the number of corpuses in the high-noise corpus set and the target element in the confidence matrix; the target element refers to the corresponding number of labels in the confidence matrix for label y _n and predicted as label y _m.

Optionally, processing the noise corpus in the initial corpus set includes:

removing the noise corpus from the initial corpus set;

Or correcting the error labels of the noise corpus in the initial corpus set.

Optionally, the initial corpus set includes a proportion of the noise-containing corpus of no more than 50%.

According to a second aspect of embodiments of the present disclosure, there is provided a corpus noise reduction device, the device including:

The label distribution acquisition module is used for acquiring estimated label distribution of the initial corpus set;

the confidence matrix acquisition module is used for acquiring a confidence matrix according to the estimated tag distribution, and the confidence matrix is used for describing tag noise distribution under the category condition;

the noise corpus acquisition module is used for acquiring noise corpus in the initial corpus set based on the confidence matrix;

the noise corpus processing module is used for processing the noise corpus in the initial corpus set to obtain a target corpus set.

Optionally, the tag distribution acquiring module includes:

the subset acquisition unit is used for dividing the initial corpus set into K subsets, and sequentially taking each subset of the K subsets as a verification set and the other subsets as training sets; k is a positive integer;

The distribution acquisition unit is used for training the preset vertical domain corpus noise reduction model by using the training set in sequence to obtain a trained vertical domain corpus noise reduction model, and acquiring estimated tag distribution of the verification set by using the trained vertical domain corpus noise reduction model to obtain K estimated tag distribution;

The distribution splicing unit is used for splicing the K estimated tag distributions to obtain estimated tag distributions of the initial corpus; the estimated tag distribution is used to approximate a lossless tag distribution representing the initial corpus.

Optionally, the confidence matrix obtaining module includes:

A probability obtaining unit, configured to obtain a prediction probability of each tag from the estimated tag distribution; the estimated tag distribution comprises the prediction probability that each corpus in the initial corpus is estimated as each tag;

An average value calculating unit, configured to calculate an average value of prediction probabilities of each tag, and use the average value as a confidence level of each tag;

the label acquisition unit is used for acquiring labels with prediction probability meeting preset confidence conditions for each corpus; the preset confidence condition means that the prediction probability is required to be the maximum prediction probability exceeding the tag confidence;

the quantity counting unit is used for counting the quantity of the linguistic data in the initial linguistic data set under the label category meeting the preset confidence condition;

A matrix construction unit for constructing a confidence matrix based on the number; the sum of all elements in the confidence matrix is 1.

Optionally, the matrix construction unit includes:

an initial matrix construction subunit, configured to use each label as a known labeling label and an unknown lossless label, and construct an initial confidence matrix for joint distribution between the approximate labeling label and the lossless label; the element in the nth row and the mth column in the initial confidence matrix represents the corpus quantity which is predicted to be a label y _m by labeling the label y _n;

The category normalization subunit is used for carrying out category normalization processing on the elements in the initial confidence matrix based on the total number of the linguistic data under each label category in the initial linguistic data set to obtain a category normalization confidence matrix;

The integral normalization subunit is used for carrying out integral normalization processing on the elements in the category normalization confidence matrix to obtain a final confidence matrix; the normalization processing includes normalization processing and overall normalization processing for each category; the element in the nth row and the mth column in the final confidence matrix represents a normalized value of corpus quantity predicted as y _m by labeling a label as y _n; the sum of all elements in the final confidence matrix is 1.

Optionally, the noise corpus acquisition module includes:

The set acquisition unit is used for acquiring the corpus corresponding to the non-diagonal and non-zero elements in the confidence matrix to obtain a high-noise corpus set;

The subset obtaining unit is used for obtaining the corpus which meets the labeling of y _n and has the prediction probability exceeding the confidence of the label y _m from the high-noise corpus set, and the corpus forms a high-noise corpus subset Snm; wherein y _n is not equal to y _m;

The corpus ordering unit is used for ordering all the corpora in the high-noise corpus subset Snm according to the prediction probability;

The corpus screening unit is used for screening the corpus of the high-noise corpus sub-set Snm from the corpus with the smallest prediction probability, and taking the corpus as the noise corpus in the initial corpus set; the number of the noise corpuses is the product of the preset noise reduction proportion, the number of corpuses in the high-noise corpus set and the target element in the confidence matrix; the target element refers to the corresponding number of labels in the confidence matrix for label y _n and predicted as label y _m.

Optionally, the noise corpus processing module includes:

a corpus rejecting unit, configured to reject the noise corpus from the initial corpus set;

Or alternatively

The corpus correction unit is used for correcting the error labels of the noise corpus in the initial corpus set.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

A processor;

A memory for storing a computer program executable by the processor;

Wherein the processor is configured to execute the computer program in the memory to implement the method as claimed in any one of the preceding claims.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor, is capable of carrying out a method as claimed in any one of the preceding claims.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

According to the embodiment, the scheme provided by the embodiment of the disclosure can obtain the estimated tag distribution of the initial corpus; then, a confidence matrix is obtained according to the estimated tag distribution, and the confidence matrix is used for describing tag noise distribution under the category condition; then, acquiring noise corpus in the initial corpus set based on the confidence matrix; and finally, processing the noise corpus in the initial corpus set to obtain a target corpus set. In this way, in the embodiment, a confidence matrix can be established through the prediction probability of the label and the label, the noise corpus in the initial corpus is identified through the confidence matrix, after the noise corpus is processed, the proportion and ambiguity information occupied by the noise corpus in the target corpus can be reduced, the boundary of the target corpus is clearer, the training times of the vertical domain model are reduced, the calculation resources and the consumed time required by training are further reduced, and the training efficiency is improved; alternatively, the present embodiment may provide a high-quality training corpus to improve the classification accuracy of the trained domain model, so that the domain model can provide a high-quality speech service when facing a user request.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flow chart illustrating a corpus noise reduction method according to an exemplary embodiment.

FIG. 2 is a flowchart illustrating obtaining an estimated tag distribution, according to an example embodiment.

FIG. 3 is a flow chart illustrating the acquisition of a confidence matrix according to an exemplary embodiment.

FIG. 4 is a flowchart illustrating the acquisition of a noise corpus, according to an example embodiment.

Fig. 5 is a graph illustrating a method for obtaining accuracy of recognition noise corpus in an application scenario according to an exemplary embodiment.

Fig. 6 is a block diagram illustrating a corpus noise reduction device according to an example embodiment.

Fig. 7 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The embodiments described by way of example below are not representative of all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus consistent with some aspects of the disclosure as detailed in the accompanying claims.

In order to solve the above technical problems, embodiments of the present disclosure provide a corpus noise reduction method and apparatus, an electronic device, and a storage medium, which may be applied to an electronic device, where the electronic device may include, but is not limited to: personal computers (Personal Computer, PCs), smart phones, servers or server clusters, etc. Fig. 1 is a flow chart illustrating a corpus noise reduction method according to an exemplary embodiment. Referring to fig. 1, a corpus noise reduction method includes steps 11 to 14.

In step 11, an estimated tag distribution of the initial corpus is obtained.

In this embodiment, the electronic device may obtain an initial corpus set from a specified location, where the initial corpus set is training corpus data in a vertical domain, and the effect is shown in table 1. The vertical field refers to the field related to the voice command of the user, such as the fields of music, weather, etc., and is not limited herein. The specified location may be a local memory, a cloud, and an external storage device of the electronic device. In an example, the initial corpus may include user-requested text and labels that the user-requested text belongs to, and the labels may be manually pre-labeled.

TABLE 1 labeling of corpora

Corpus material	Labeling label
		X_i＝1	y_j＝2
X_i＝2	y_j＝1
		X_i＝3	y_j＝1
X_i＝4	y_j＝1
		X_i＝5	y_j＝2

It should be noted that, in this embodiment, it is assumed that a part of the corpora in the initial corpus set contains noise, and another part of the corpora is noiseless. Since the noiseless corpus is the basis of noise reduction, the corpus containing noise cannot account for more than 50% of the total corpus in the initial corpus set.

In this embodiment, a vertical domain corpus noise reduction model is pre-stored in the electronic device, the input of the vertical domain corpus noise reduction model is vertical domain corpus, the output of the vertical domain corpus noise reduction model is the probability that each corpus corresponds to each label, and the cross entropy of the predicted label distribution and the labeled label is used as an optimization target. In an example, the vertical corpus noise reduction model may be constructed using a pre-trained language model BERT and a multi-layer perceptron.

In this embodiment, referring to fig. 2, in step 21, the electronic device may divide the corpus of the initial corpus set into K subsets, such as D ¹,D²,...,D^K. K is a positive integer, and K can be 10, and can be set according to specific scenes, and is not limited herein. The electronic device may then take each of the K subsets as a verification set and the other subsets as training sets in turn. For example, the 1 st subset D ¹ may be used as a verification set, and the 2 nd, 3 rd, … … th, K th subsets D ²,...,D^K may be used as training sets; the 2 nd subset, D ², can be used as the validation set, the 1 st, 3 rd, … … th, K subsets D ¹,D³,...,D^K as the training set; … …, the Kth time, the Kth subset D ^K can be used as a verification set, the 1 st, 2 nd, … … th and K-1 st subsets D ¹,D²,...,D^K-1 are used as training sets, and finally the K groups of training sets and verification sets are obtained.

In the process of dividing the subsets, the electronic device can also consider labeling labels of the corpora, so that the label distribution of the corpora in each subset is the same as the label distribution of the corpora in the initial corpus; or, for each subset, the ratio of the number of corpora of each tag in the subset to the total number of corpora in the subset is equal to the ratio of the number of corpora of the tag in the initial corpus to the total number of corpora in the initial corpus.

It should be further noted that, since the total number of corpora in the initial corpus is very large, usually in excess of millions, the proportion of the subset and the same label of the initial corpus can be regarded as the same; as the total number of the corpora in the initial corpus decreases, for example, if the total number of the corpora decreases to ten thousand, the proportion of the same label in each subset may be different when the corpora is divided into 10 groups (i.e., K takes a value of 10), and an error may exist between the proportion of the same label in the initial corpus and the proportion of the same label in the subset and the initial corpus may be regarded as the same if the error does not exceed 5%. That is, the above-mentioned "identical" includes a close approximation with a certain error in addition to the exact identity.

In this embodiment, referring to fig. 2, in step 22, after the training set and the verification set are divided, the electronic device may sequentially input the corpora of the training set (for example, the 1 st to the 1 st subsets in the kth training) into the preset vertical domain corpus denoising model, that is, sequentially train the preset vertical domain corpus denoising model by using the training set, until all corpora are used up or the success rate of the label predicted by the preset vertical domain corpus denoising model exceeds a set threshold (for example, 95% -99%), and finally obtain the trained vertical domain corpus denoising model. And then, the electronic equipment can acquire the estimated tag distribution of the verification set by using the trained vertical-domain corpus noise reduction model, namely, each corpus in the verification set is sequentially input into the trained vertical-domain corpus noise reduction model, and the vertical-domain corpus noise reduction model outputs the prediction probability of each corpus on each tag. And repeating the training and verification steps, wherein after K sets of training sets and verification sets are used up through K rounds, the electronic equipment can acquire K estimated tag distributions, namely each verification set corresponds to one estimated tag distribution P ^k.

In this embodiment, with continued reference to fig. 2, in step 23, the electronic device may splice K estimated tag distributions P ¹,P²,...,P^K, so as to obtain an estimated tag distribution P of the initial corpus. The estimated tag distribution P described above may be used to approximate the lossless tag distribution P' representing the initial corpus. The lossless label refers to a real label corresponding to the corpus, the lossless label distribution P ' is difficult to obtain, and the estimated label distribution P is adopted to approximately replace the lossless label distribution P ' in the example, but errors exist in the lossless label distribution P '. In order to avoid that the errors affect the subsequently identified noise corpus, the estimated tag distribution is not directly used as a noise identification basis in the embodiment of the present disclosure, and in this embodiment, the confidence matrix is obtained by estimating the tag distribution, which is described in detail in step 12.

In step 12, a confidence matrix is obtained according to the estimated tag distribution, wherein the confidence matrix is used for describing the tag noise distribution under the category condition.

In this embodiment, after obtaining the estimated tag distribution P, the electronic device may obtain the confidence matrix according to the estimated tag distribution P, see fig. 3, including steps 31 to 35.

In step 31, the electronic device may obtain a predicted probability for each tag from within the estimated tag distribution; wherein estimating the tag distribution includes estimating a predictive probability of each corpus in the initial corpus set being estimated as each tag.

In step 32, the electronic device may calculate an average value of the prediction probabilities { P _1j,P_2j,...,P_(t-1)j,P_tj } of the respective tags y _j, and use the average value as the confidence level of the respective tags, and the effect is shown in table 2. Where t represents the total number of corpora in the initial corpus set.

TABLE 2 confidence level of each tag

In step 33, for each corpus, the electronic device may obtain a label whose prediction probability satisfies a preset confidence condition; the preset confidence condition means that the prediction probability needs to be the maximum prediction probability exceeding the tag confidence, and the effect is shown in table 3. Referring to table 3, the bolded numbers represent the maximum predicted probability of exceeding the tag confidence level, i.e., the elements satisfying the filtering condition, under each category.

Referring to tables 2 and 3, for corpus X _i＝1, the prediction probability of label y _j＝3 of the three labels is 0.51 exceeding the confidence level of label y _j＝3, and the prediction probabilities of labels y _j＝1 and y _j＝2 do not exceed the confidence level of the corresponding labels, so that 0.51 is bolded. For corpus X _i＝2, the prediction probability of label y _j＝1 in the three label prediction probabilities exceeds the confidence level of label y _j＝1 by 0.34, and the prediction probabilities of label y _j＝2 and label y _j＝3 do not exceed the confidence level of the corresponding labels, so that 0.51 is thickened. For corpus X _i＝3, the prediction probability of label y _j＝2 in the prediction probabilities of the three labels exceeds the confidence level of label y _j＝2 by 0.25, and the prediction probabilities of label y _j＝1 and label y _j＝3 do not exceed the confidence level of the corresponding labels, so that 0.34 is thickened. For corpus X _i＝4, the prediction probabilities of label y _j＝2, label y _j＝1 and label y _j＝3 in the prediction probabilities of the three labels do not exceed the confidence level of the corresponding labels, so that no prediction probability is bolded. For corpus X _i＝5, the prediction probability of label y _j＝3 in the three label prediction probabilities exceeds the confidence level of label y _j＝3 by 0.42, and the prediction probabilities of label y _j＝1 and label y _j＝3 do not exceed the confidence level of the corresponding labels, so that 0.61 is thickened.

TABLE 3 maximum prediction probability of exceeding tag confidence under each category

In step 34, the electronic device may count the number of corpora in the initial corpus set under the tag category that satisfies a preset confidence condition. With continued reference to Table 3, the number of corpora under the labeling label y _j＝1 category is 1 (i.e., corpus X _i＝2), the number of corpora under the labeling label y _j＝2 category is 1 (i.e., corpus X _i＝3), and the number of corpora under the labeling label y _j＝3 category is 2 (i.e., corpus X _i＝1 and corpus X _i＝5).

In step 35, the electronic device may construct a confidence matrix based on the number; the sum of all elements in the confidence matrix is 1. For example, the electronic device may construct an initial confidence matrix for the joint distribution between the approximate labeling tag and the lossless tag with each tag as a known labeling tag and an unknown lossless tag, with the effects shown in table 4.

TABLE 4 initial confidence matrix

	y_m＝1	y_m＝2	y_m＝3
				y_n＝1	1	1	0
y_n＝2	0	0	2
				y_n＝3	0	0	0

The element in the nth row and the mth column in the initial confidence matrix represents the corpus quantity which is predicted to be the label y _m by labeling the label y _n.

The electronic device may normalize the initial confidence matrix based on the corpus quantity of each category, where the normalization process includes normalization processing for each category (the effect is shown in table 5) and overall normalization processing, and a final confidence matrix is obtained, and the effect is shown in table 6. The element in the nth row and mth column in the final confidence matrix represents the normalized value of the corpus quantity predicted as y _m with the label of y _n.

Referring to tables 1, 4 and 5, for the label y _n＝1, the first row element sum in the initial confidence matrix is 1+1+0=2, and the total corpus labeled y _n＝1 in the initial corpus is 3, the first row element in table 4 is scaled, i.e., 1×3/2=1.5, in order to keep the first row element sum consistent with the total corpus of label y _n＝1. For the label y _n＝2, the sum of the elements of the second row is 0+0+2=2, and the sum of the corpora labeled y _n＝2 in the initial corpus is 2, and the sum of the elements of the second row is consistent without scaling the elements of the second row of the matrix. Tag y _n＝3 is the same. The scaled confidence matrix is shown in table 5.

TABLE 5 confidence matrix normalized by category

	y_m＝1	y_m＝2	y_m＝3
				y_n＝1	1.5	1.5	0
y_n＝2	0	0	2
				y_n＝3	0	0	0

In addition, to obtain confidence values in the [0,1] range, the whole confidence matrix is normalized, i.e., each element is scaled by dividing by the sum of all elements of the confidence matrix. For example, 1.5/(1.5+1.5+2) = 0.3,2/(1.5+1.5+2) =0.4. The final normalized confidence matrix is shown in table 6, where the sum of all the numeric elements in the confidence matrix is equal to 1.

TABLE 6 confidence matrix after final normalization

	y_m＝1	y_m＝2	y_m＝3
				y_n＝1	0.3	0.3	0
y_n＝2	0	0	0.4
				y_n＝3	0	0	0

In step 13, a noise corpus in the initial corpus set is obtained based on the confidence matrix.

In this embodiment, after obtaining the confidence matrix, the electronic device may obtain the noise corpus in the initial corpus set based on the confidence matrix, see fig. 4, including steps 41 to 44.

In step 41, the electronic device may obtain corpora corresponding to non-diagonal and non-zero elements in the confidence matrix, so as to obtain a high-noise corpus set S. Wherein a high noise corpus set S is used to represent a large probability of erroneous/unreasonable corpora.

With continued reference to tables 3 and 6, the elements at the (1, 2) and (2, 3) positions in table 6 are selected, while the element at the upper left corner (1, 1) position in table 6 is not selected on the diagonal. Wherein the element at the (1, 2) position corresponds to corpus X _i＝3 in table 3; the elements at the (2, 3) positions correspond to corpora X _i＝1 and X _i＝5 in table 3. Thus, the high noise corpus set may include corpora X _i＝3、X_i＝1 and X _i＝.

In step 42, the electronic device may obtain, from the high-noise corpus set, a corpus that satisfies the label y _n and that has a prediction probability exceeding the confidence level of the label y _m, where the corpus forms a high-noise corpus subset Snm; wherein y _n≠y_m. Taking S _n＝2,m＝3 as an example, the corpus X _i＝1(P_i＝1,j＝3 =0.51) and X _i＝5(P_i＝5,j＝3 =0.61 are contained in the subset S _n＝2,m＝3).

In step 43, the electronic device may sort the corpora in the subset of high-noise corpora Snm according to a prediction probability. Taking ascending order as an example, the order is: x _i＝1(P_i＝1,j＝3＝0.51)、X_i＝5(P_i＝5,j＝3 = 0.61).

In step 44, the electronic device may screen the corpus of the subset of high-noise language materials Snm from the corpus with the smallest prediction probability, and use the corpus as the noise corpus in the initial corpus set; the number of the noise corpuses is the product of the preset noise reduction proportion, the number of corpuses in the high-noise corpus set and the target element in the confidence matrix; the target element refers to the corresponding number of labels in the confidence matrix for label y _n and predicted as label y _m.

The value of the preset noise reduction proportion alpha is 0.5, the number T of the corpus in the high-noise corpus set and the target element C _n＝2,m＝3 in the confidence matrix can be used for calculating the number of the selected corpus from the high-noise corpus subset Snm: α×c _n＝2,m＝3 =0.5×0.4=0.7×1, i.e., 1 corpus is selected as the noise corpus. Then, X _i＝1(P_i＝1,j＝3 =0.51) and X _i＝5(P_i＝5,j＝3 =0.61) are ordered in order from large to small, and since 0.51<0.61, the corpus X _i＝1 corresponding to 0.51 is determined as the final noise corpus. That is, corpus with the minimum alpha x T x C _nm prediction probabilities P _im is selected as the noise recognition result.

In step 14, the noise corpus in the initial corpus set is processed, and a target corpus set is obtained.

In this embodiment, after determining the noise corpus, the electronic device may process the noise corpus in the initial corpus set. The processing mode can comprise the following steps: removing noise corpus from the initial corpus set; or correcting the error label of the noise corpus in the initial corpus set, and reminding a user of manual correction when correcting, and replacing the original noise corpus with the corrected noise corpus. In this way, the electronic device can obtain the target corpus.

It can be appreciated that after the target corpus is obtained, the electronic device can use the target corpus to perform model training. For example, the vertical domain corpus noise reduction model is continuously trained, the confidence is recalculated, and the labels and the prediction probabilities of the corpus are readjusted. In another example, a vertical domain service model is trained, and high-quality service is provided by the vertical domain service model and fed back to the user.

The scheme provided by the embodiment of the disclosure can obtain the estimated tag distribution of the initial corpus; then, a confidence matrix is obtained according to the estimated tag distribution, and the confidence matrix is used for describing tag noise distribution under the category condition; then, acquiring noise corpus in the initial corpus set based on the confidence matrix; and finally, processing the noise corpus in the initial corpus set to obtain a target corpus set. In this way, in the embodiment, a confidence matrix can be established through the prediction probability of the label and the label, the noise corpus in the initial corpus is identified through the confidence matrix, after the noise corpus is processed, the proportion and ambiguity information occupied by the noise corpus in the target corpus can be reduced, the boundary of the target corpus is clearer, the training times of the vertical domain model are reduced, the calculation resources and the consumed time required by training are further reduced, and the training efficiency is improved; alternatively, the present embodiment may provide a high-quality training corpus to improve the classification accuracy of the trained domain model, so that the domain model can provide a high-quality speech service when facing a user request.

The effect of the corpus noise reduction method provided by the embodiment is analyzed by combining the application scene of the intelligent voice assistant.

The electronic equipment can acquire million-level vertical domain corpus as an initial corpus set, and then the noise corpus is identified by utilizing the corpus noise reduction method, so that the noise corpus is automatically mined, the noise identification accuracy is high, and the labor cost is reduced. In addition, in the example, the computing resource required in the implementation process is relatively small, the energy consumption is low, and taking a million-level vertical domain corpus as an example, the V100 single card needs 5 hours.

Based on the manual review result, the same noise corpus as the manual review result is determined to be accurate, the evaluation result is shown in table 7, and the accuracy is shown in fig. 5.

TABLE 7

Referring to table 7 and fig. 5, the corpus noise reduction method provided by the present example can identify the noise corpus in the initial corpus set, and improve the accuracy along with the reduction of the noise corpus, so as to facilitate improvement of the corpus quality and improvement of the quality of service provided by the subsequent vertical domain model. Taking a real scene of vertical domain model corpus noise reduction applied to intelligent voice hand-assisting as an example, the evaluation results are shown in tables 8 and 9.

TABLE 8 experimental results of multifunctional classification model in vertical domain

TABLE 9 vertical recall two-class model test results

# Noise corpus: the number of possible noise corpora identified by the model;

Average (noise reduction) ≡original ratio;

Original ratio: on the vertical domain corpus, model classification performance when noise is not reduced;

The ratio after correction: on the vertical domain corpus, the model classification performance after noise reduction;

on-line feedback non-noise reduction ratio: in the failure example of the on-line feedback, model classification performance without noise reduction;

on-line feedback and corpus removal ratio: in the failure example of the on-line feedback, model classification performance after noise is removed;

Lifting ratio: the corrected ratio minus the original ratio.

On the basis of the corpus noise reduction method provided by the embodiment of the present disclosure, the embodiment of the present disclosure further provides a corpus noise reduction device, which is applied to an electronic device, referring to fig. 6, and the device includes:

The tag distribution obtaining module 61 is configured to obtain an estimated tag distribution of the initial corpus;

a confidence matrix acquisition module 62, configured to acquire a confidence matrix according to the estimated tag distribution, where the confidence matrix is used to describe tag noise distribution under a category condition;

a noise corpus acquisition module 63, configured to acquire a noise corpus in the initial corpus set based on the confidence matrix;

The noise corpus processing module 64 is configured to process the noise corpus in the initial corpus set to obtain a target corpus set.

In an embodiment, the tag distribution acquiring module includes:

In an embodiment, the label distribution of the corpus in each subset is the same as the label distribution of the corpus in the initial corpus set.

In an embodiment, the confidence matrix obtaining module includes:

In an embodiment, the matrix construction unit comprises:

In an embodiment, the noise corpus acquisition module includes:

In an embodiment, the noise corpus processing module includes:

Or alternatively

In an embodiment, the initial corpus set includes no more than 50% of the noise-containing corpus.

It should be noted that, the device shown in this embodiment is matched with the content of the method embodiment shown in fig. 1, and reference may be made to the content of the method embodiment described above, which is not described herein again.

Fig. 7 is a block diagram of an electronic device, according to an example embodiment. For example, the electronic device 700 may be a smart phone, a computer, a digital broadcast terminal, a tablet device, a medical device, an exercise device, a personal digital assistant, or the like.

Referring to fig. 7, an electronic device 700 may include one or more of the following components: a processing component 702, a memory 704, a power component 706, a multimedia component 708, an audio component 710, an input/output (I/O) interface 712, a sensor component 714, a communication component 716, an image acquisition component 718, and a housing as described above.

The processing component 702 generally controls overall operation of the electronic device 700, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 702 may include one or more processors 720 to execute computer programs. Further, the processing component 702 can include one or more modules that facilitate interaction between the processing component 702 and other components. For example, the processing component 702 may include a multimedia module to facilitate interaction between the multimedia component 708 and the processing component 702.

The memory 704 is configured to store various categories of data to support operation at the electronic device 700. Examples of such data include computer programs, contact data, phonebook data, messages, pictures, videos, etc. for any application or method operating on the electronic device 700. The memory 704 may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 706 provides power to the various components of the electronic device 700. Power supply components 706 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 700. The power supply assembly 706 may include a power chip and the controller may communicate with the power chip to control the power chip to turn on or off the switching device to power the motherboard circuit with or without the battery.

The multimedia component 708 includes a screen that provides an output interface between the electronic device 700 and the target object. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input information from a target object. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or sliding action, but also the duration and pressure associated with the touch or sliding operation.

The audio component 710 is configured to output and/or input audio file information. For example, the audio component 710 includes a Microphone (MIC) configured to receive external audio file information when the electronic device 700 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio file information may be further stored in the memory 704 or transmitted via the communication component 716. In some embodiments, audio component 710 further includes a speaker for outputting audio file information.

The I/O interface 712 provides an interface between the processing component 702 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc.

The sensor assembly 714 includes one or more sensors for providing status assessment of various aspects of the electronic device 700. For example, the sensor assembly 714 may detect an on/off state of the electronic device 700, a relative positioning of the components, such as a display and keypad of the electronic device 700, a change in position of the electronic device 700 or one of the components, the presence or absence of a target object in contact with the electronic device 700, an orientation or acceleration/deceleration of the electronic device 700, and a change in temperature of the electronic device 700. In this example, the sensor assembly 714 can include a magnetic force sensor, a gyroscope, and a magnetic field sensor, wherein the magnetic field sensor includes at least one of: hall sensors, thin film magneto-resistive sensors, and magnetic liquid acceleration sensors.

The communication component 716 is configured to facilitate communication between the electronic device 700 and other devices, either wired or wireless. The electronic device 700 may access a wireless network based on a communication standard, such as WiFi,2G, 3G, 4G, 5G, or a combination thereof. In one exemplary embodiment, the communication component 716 receives broadcast information or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 716 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 700 can be implemented by one or more Application Specific Integrated Circuits (ASICs), digital information processors (DSPs), digital information processing devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements.

In an exemplary embodiment, there is also provided an electronic device including:

A processor;

A memory for storing a computer program executable by the processor;

Wherein the processor is configured to execute the computer program in the memory to implement the steps of the method as described in fig. 1.

In an exemplary embodiment, a computer readable storage medium, e.g. a memory comprising instructions, is also provided, the executable computer program being executable by a processor to implement the steps of the method as described in fig. 1. The readable storage medium may be, among other things, ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for corpus noise reduction, the method comprising:

Acquiring estimated tag distribution of an initial corpus;

processing noise corpus in the initial corpus set to obtain a target corpus set;

obtaining a confidence matrix according to the estimated tag distribution, including:

Calculating an average value of the prediction probabilities of all the labels, and taking the average value as the confidence of each label;

aiming at each corpus, obtaining labels with predictive probability meeting a preset confidence coefficient condition; the preset confidence condition means that the prediction probability needs to be the maximum prediction probability exceeding the label confidence;

Counting the number of the linguistic data in the initial corpus set under the label category meeting the preset confidence coefficient condition;

2. The method of claim 1, wherein obtaining an estimated tag distribution for an initial corpus comprises:

3. The method of claim 2, wherein the label distribution of the corpora in each subset is the same as the label distribution of the corpora in the initial corpus set.

4. The method of claim 1, wherein constructing a confidence matrix based on the number comprises:

5. The method of claim 1, wherein obtaining noise corpus in the initial corpus set based on the confidence matrix comprises:

acquiring a corpus meeting the confidence that the labeling label is y _n and the prediction probability exceeds the label y _m from the high-noise corpus set, wherein the corpus forms a high-noise corpus sub-set Snm; wherein y _n is not equal to y _m;

6. The method of claim 1, wherein processing the noise corpus in the initial corpus set comprises:

removing the noise corpus from the initial corpus set;

Or correcting the error labels of the noise corpus in the initial corpus set.

7. The method of claim 1, wherein the initial corpus set comprises a proportion of noise-containing corpora of no more than 50%.

8. A corpus noise reduction device, the device comprising:

the noise corpus processing module is used for processing the noise corpus in the initial corpus set to obtain a target corpus set;

The confidence matrix acquisition module comprises:

The label acquisition unit is used for acquiring labels with prediction probability meeting the preset confidence coefficient condition aiming at each corpus; the preset confidence condition means that the prediction probability needs to be the maximum prediction probability exceeding the label confidence;

The quantity counting unit is used for counting the quantity of the linguistic data in the initial linguistic data set under the label category meeting the preset confidence coefficient condition;

9. The apparatus of claim 8, wherein the tag distribution acquisition module comprises:

10. The apparatus of claim 9, wherein a tag distribution of the corpora in each subset is the same as a tag distribution of the corpora in the initial corpus set.

11. The apparatus of claim 8, wherein the matrix construction unit comprises:

12. The apparatus of claim 8, wherein the noise corpus acquisition module comprises:

The subset obtaining unit is used for obtaining the corpus which meets the confidence coefficient that the labeling label is y _n and the prediction probability exceeds the label y _m from the high-noise corpus set, and the corpus forms a high-noise corpus subset Snm; wherein y _n is not equal to y _m;

13. The apparatus of claim 8, wherein the noise corpus processing module comprises:

Or alternatively

14. The apparatus of claim 8, wherein the initial corpus set comprises a noise-containing corpus having a proportion of no more than 50%.

15. An electronic device, comprising:

A processor;

A memory for storing a computer program executable by the processor;

wherein the processor is configured to execute the computer program in the memory to implement the method of any of claims 1-7.

16. A computer readable storage medium, characterized in that a computer program executable in the storage medium, when executed by a processor, is capable of implementing the method according to any of claims 1-7.