CN111554270A

CN111554270A - Training sample screening method and electronic equipment

Info

Publication number: CN111554270A
Application number: CN202010354551.3A
Authority: CN
Inventors: 许孝先; 冯大航; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-08-18
Anticipated expiration: 2040-04-29
Also published as: CN111554270B

Abstract

The invention provides a training sample screening method and electronic equipment, wherein the method comprises the following steps: inputting a training sample set into a first network model for setting labels to obtain at least one label corresponding to the training sample set, wherein the training sample set comprises at least two voice samples, and each label in the at least one label respectively corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples; respectively determining the duration of the voice signal corresponding to each tag in the at least one tag; and screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set. The embodiment of the invention can improve the accuracy of the trained model.

Description

Training sample screening method and electronic equipment

Technical Field

The invention relates to the technical field of voice processing, in particular to a training sample screening method and electronic equipment.

Background

Natural language refers to a language for communication between humans generated through natural evolution. Natural Language Processing (NLP) is a field of computer science, artificial intelligence, linguistics that focuses on the interaction between computer and human (Natural) Language. The natural language processing technology can process the voice based on the network model, and meets the requirements of various use scenes, for example, in a voice recognition use scene, the voice can be processed into words based on an acoustic model, a language model and a decoder.

Before speech is processed based on the network model, the model needs to be trained using training samples. In the process of training the model, training samples are input into the network model according to frames, and parameters of the network model are updated based on target output corresponding to at least one input frame of voice. When dirty data is present in the training samples, the target output may not be accurately determined, resulting in less accuracy of the trained model.

Disclosure of Invention

The embodiment of the invention provides a training sample screening method and electronic equipment, and aims to solve the problem that in the prior art, when dirty data exists in a training sample, target output cannot be accurately determined, and therefore the accuracy of a trained model is low.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a training sample screening method, where the method includes:

inputting a training sample set into a first network model for setting labels to obtain at least one label corresponding to the training sample set, wherein the training sample set comprises at least two voice samples, and each label in the at least one label respectively corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples;

respectively determining the duration of the voice signal corresponding to each tag in the at least one tag;

and screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set.

In a second aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:

a first input module, configured to input a training sample set into a first network model for setting a label, to obtain at least one label corresponding to the training sample set, where the training sample set includes at least two voice samples, and each label in the at least one label corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples, respectively;

the determining module is used for respectively determining the duration of the voice signal corresponding to each tag in the at least one tag;

and the first screening module is used for screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor and a program stored on the memory and executable on the processor, the program, when executed by the processor, implementing the steps in the training sample screening method according to the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the training sample screening method according to the first aspect.

In the embodiment of the invention, a training sample set is input into a first network model for setting labels to obtain at least one label corresponding to the training sample set, wherein the training sample set comprises at least two voice samples, and each label in the at least one label respectively corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples; respectively determining the duration of the voice signal corresponding to each tag in the at least one tag; and screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set. Therefore, possibly existing dirty data in the training sample can be screened out based on the duration of the voice signal corresponding to each label in the at least one label, so that the model is trained by adopting the screened training sample set, and the accuracy of the trained model can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of a training sample screening method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 3 is a second schematic structural diagram of an electronic device according to an embodiment of the invention;

fig. 4 is a third schematic structural diagram of an electronic apparatus according to an embodiment of the present invention;

fig. 5 is a fourth schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted mobile terminal, a wearable device, a server, a pedometer, and the like.

Referring to fig. 1, fig. 1 is a flowchart of a training sample screening method according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

step 101, inputting a training sample set into a first network model for setting labels to obtain at least one label corresponding to the training sample set, where the training sample set includes at least two voice samples, and each label in the at least one label corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples.

All the voice samples in the training sample set can be input into the first network model, and the labels are set for all the voice samples in the training sample set. All the voice samples in the training sample set can be input into the first network model according to each frame of voice signal, and the corresponding label of each frame of voice signal can be obtained. The speech samples in the training sample set may be framed, features may be extracted from each frame of speech signal, for example, MFCC (Mel-scale frequency cepstral coefficients) features or PCEN (Per-Channel Energy Normalization) features may be extracted, and the speech frame with the features extracted may be input into the first network model. Taking the labels as the phoneme labels, in a certain speech sample, the phoneme labels corresponding to the speech signals of the first frame to the tenth frame may be "k", and the phoneme labels corresponding to the speech signals of the eleventh frame to the eighteenth frame may be "i".

In addition, the first network model may be a Neural network model, for example, the first network model may be a DNN (Deep Neural Networks) model, or may also be a DNN-HMM (Hidden markov model) model, or may be a CNN (Convolutional Neural Networks) model, or the like. The first network model may be used to set labels to implement alignment operations. The first network Model may be trained by using a GMM (Gaussian Mixture Model) -HMM network Model or other neural network Model to obtain a training sample of the first network Model. The tags may be phoneme tags, for example, may be mono-phoneme tags, or tri-phoneme tags, or may be other tags as well.

And 102, respectively determining the duration of the voice signal corresponding to each tag in the at least one tag.

If a certain tag corresponds to a frame of voice signal, the duration of the voice signal corresponding to the tag may be the duration of a frame of voice signal, and if a certain tag corresponds to multiple frames of continuous voice signals, the duration of the voice signal corresponding to the tag may be the total duration of the multiple frames of continuous voice signals. For example, in a certain speech sample, the phoneme label corresponding to the first to tenth frames of speech signals is "k", the duration of the speech signal corresponding to the "k" label may be the total duration of ten frames of speech signals, and if the duration of one frame of speech signal is 25ms, the duration of the speech signal corresponding to the "k" label may be 250 ms.

103, screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set.

The at least one tag may include K different tags, where K is a positive integer, the voice samples in the training sample set are screened based on the duration of the voice signal corresponding to each tag in the at least one tag to obtain a screened training sample set, and the average duration of the voice signal corresponding to each tag in the K tags is respectively calculated based on the duration of the voice signal corresponding to each tag in the at least one tag, and the voice samples in the training sample set are screened based on the average duration of the voice signal corresponding to each tag in the K tags to obtain the screened training sample set.

Or, the speech samples in the training sample set are screened based on the duration of the speech signal corresponding to each tag in the at least one tag, so as to obtain a screened training sample set, or, if a second tag exists in the at least one tag, the speech sample to which the speech signal corresponding to the second tag belongs is deleted from the training sample set, where the duration of the speech signal corresponding to the second tag is greater than a preset duration.

In practical applications, for example, voice recognition is used, and voice can be converted into words through voice recognition, and voice can be converted into words through a neural network model. Prior to speech recognition, a neural network model may be trained using a set of training samples. The speech samples in the training sample set may be framed, features, such as MFCC features or PCEN features, may be extracted from each frame, and the speech frames after the features are extracted may be input to a neural network model for training, where a learning target of the neural network model may be a phoneme. For a neural network model trained on a frame basis, it is necessary to determine the approximate correct target for each frame, i.e., the alignment operation. When dirty data exists in the training sample set, for example, noise exists in the samples, the labels of the noise part may have errors of tens or even hundreds of frames, the problem is large, and the accuracy of the trained model is low when the training sample set with the dirty data is used for model training. In the embodiment of the invention, the possibly existing dirty data in the training sample is screened out based on the duration of the voice signal corresponding to each label in the at least one label, so that the model is trained by adopting the screened training sample set, and the accuracy of the trained model can be improved.

Optionally, the selecting, by the at least one tag, at least one label including K different labels, where K is a positive integer, and the selecting, based on a duration of a voice signal corresponding to each label in the at least one label, the voice samples in the training sample set to obtain a filtered training sample set includes:

respectively calculating the average time length of the voice signal corresponding to each label in the K kinds of labels based on the time length of the voice signal corresponding to each label in the at least one label;

and screening the voice samples in the training sample set based on the average duration of the voice signal corresponding to each label in the K labels to obtain a screened training sample set.

Wherein a plurality of identical tags may be included in the at least one tag as a single tag, for example, the labels are phoneme labels, the at least one label may include a label a, a label b, a label c and a label d, the label a may be a label "k" corresponding to the first to tenth frames of speech signals in the first speech sample, the label b may be a label "i" corresponding to the eleventh to eighteenth frames of speech signals in the first speech sample, the label c may be a label "k" corresponding to the twenty-first to twenty-eighth frames of speech signals in the second speech sample, and the label d may be a label "k" corresponding to the thirty-first to forty frames of speech signals in the second speech sample, it can be considered that the label a and the label c are one of K kinds of labels, and the label b and the label d are the other of the K kinds of labels.

In addition, each of the tags may include one tag or multiple tags, and if a certain tag includes one tag, the average duration of the voice signal corresponding to the certain tag may be the duration of the voice signal corresponding to the tag included in the certain tag; if a certain tag includes m tags, the average duration of the voice signals corresponding to the tag may be a quotient of m and the total duration of the voice signals corresponding to the m tags included in the tag, where m is a positive integer greater than 1. For example, if a certain tag includes a tag a and a tag c, if the duration of the voice signal corresponding to the tag a is 250ms, and the duration of the voice signal corresponding to the tag c is 200ms, the average duration of the voice signal corresponding to the certain tag may be: (250+200)/2, i.e., 225 ms.

Further, based on the average duration of the voice signal corresponding to each of the K kinds of tags, the voice samples in the training sample set are screened to obtain a screened training sample set, wherein if a target label exists in the at least one label, deleting the voice sample to which the voice signal corresponding to the target label belongs from the training sample set, wherein the duration of the voice signal corresponding to the target tag is greater than a first threshold, or the duration of the voice signal corresponding to the target tag is less than a second threshold, the first threshold is the product of a first preset coefficient and the average duration of the voice signals corresponding to the tags of the category to which the target tag belongs, the second threshold is a product of a second preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, and the first preset coefficient is greater than the second preset coefficient.

In this embodiment, based on the average duration of the voice signal corresponding to each of the K kinds of labels, the voice samples in the training sample set are screened, so that the screening accuracy can be improved, and therefore, the model is trained by using the screened training sample set, and the accuracy of the trained model can be further improved.

Optionally, the screening, based on the average duration of the voice signal corresponding to each of the K kinds of labels, the voice samples in the training sample set to obtain a screened training sample set includes:

if a target label exists in the at least one label, deleting the voice sample to which the voice signal corresponding to the target label belongs from the training sample set;

the duration of the voice signal corresponding to the target tag is greater than a first threshold, or the duration of the voice signal corresponding to the target tag is less than a second threshold, the first threshold is a product of a first preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, the second threshold is a product of a second preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, and the first preset coefficient is greater than the second preset coefficient.

The first preset coefficient may be a number greater than 1, and the second preset coefficient may be a number less than 1, so as to adapt to different speeds of speech of different people. The first predetermined factor may be 2, or 5, or 10, etc., and the second predetermined factor may be 0.1, or 0.3, or 0.7, etc. In practical applications, the first preset coefficient may be set to be 3, and the second preset coefficient is 1/3, so that the speech samples with wrong alignment can be screened out without damaging the speech samples in the training sample set as much as possible.

In this embodiment, if a target tag exists in the at least one tag, the voice sample to which the voice signal corresponding to the target tag belongs is deleted from the training sample set, so that dirty data possibly existing in the training sample can be further accurately screened out, and the model training performance is improved.

Optionally, the at least one tag includes a mute tag and a non-mute tag, and the method further includes:

inputting the voice samples in the screened training sample set into a second network model for voice recognition to obtain voice sections;

and screening the voice samples in the screened training sample set based on the voice signals corresponding to the voice sections and the non-mute labels.

Wherein the second network model is used for human voice recognition, the second network model may be a neural network model, for example, may be a DNN model, or may be a CNN model, or may also be a recurrent neural network model, and so on. The speech samples in the screened training sample set can be input into the second network model by frames, and the second network model can predict that each frame of input signals are speech signals of a human speech segment, or signals of a mute segment, or noise signals. The voice samples in the set of training samples after being screened are screened based on the voice signal corresponding to the voice segment and the non-silent label, or the voice samples in the set of training samples after being screened are screened based on the comparison result by comparing each frame of voice signal in the voice segment with the voice signal corresponding to the non-silent label. The screening of the voice samples in the set of screened training samples based on the comparison result may be that, if the deviation between the voice signal range of the voice speech segment and the voice signal range of the voice signal corresponding to the non-mute label exceeds a preset value, the voice sample to which the voice speech segment belongs may be deleted, where the preset value may be 3, or may be 5, or may be 10, and so on.

In this embodiment, the voice samples in the selected training sample set are input into the second network model for voice recognition to obtain a voice segment, and the voice samples in the selected training sample set are selected based on the voice segment and the voice signal corresponding to the non-silent label, so that the first screening can be performed based on the first network model, and the second screening can be performed based on the first network model and the second network model.

Optionally, the screening the voice samples in the training sample set after being screened based on the voice signal of the voice segment corresponding to the non-silent label includes:

if N frames of voice signals exist between a first frame of voice signal in the voice section and a first frame of voice signal in the voice signal corresponding to a first label, deleting the voice sample to which the voice section belongs from the screened training sample set, wherein the first label is a non-mute label corresponding to the voice section, and N is larger than a first preset value;

or, if M frames of voice signals exist between the last frame of voice signal in the voice segment and the last frame of voice signal in the voice signal corresponding to the first tag, deleting the voice sample to which the voice segment belongs from the screened training sample set, where M is greater than a second preset value.

Wherein, the values of N and M can be the same or different. The value of N may be 3, or may be 5, or may be 10, etc. The value of M may be 3, or may be 5, or may be 10, etc. For example, N and M have the same value and are both 5, in this example, the training sample set includes a third voice sample, the plurality of vocal segments of the third voice sample can be obtained through the second network model, taking the first vocal segment of the third voice sample as an example, the first vocal segment of the third voice sample can include the first frame to the fiftieth frame voice signals of the third voice sample, the plurality of labels of the third voice sample can be obtained through the first network model, the first non-silent label of the third voice sample can be the non-silent label corresponding to the first vocal segment, the first non-silent label of the third voice sample can include the first frame to the sixteenth frame voice signals of the third voice sample, there is 10 frame voice signals between the last frame voice signal in the first vocal segment and the last frame voice signal in the voice signal corresponding to the first non-silent label, a value greater than M, the third speech sample may be deleted from the set of screened training samples.

In addition, the first network model may be used to align the voice samples in the training sample set during the first screening, and the second network model may be used to align the voice samples in the training sample set after the first screening during the second screening. Through the two alignments, dirty data that may be present in the training samples may be screened out.

In this embodiment, if N frames of speech signals exist between a first frame of speech signal in the speech segment of the human voice and a first frame of speech signal in the speech signal corresponding to the first tag, or M frames of speech signals exist between a last frame of speech signal in the speech segment of the human voice and a last frame of speech signal in the speech signal corresponding to the first tag, the speech sample to which the speech segment of the human voice belongs is deleted from the training sample set after the screening, so that the prediction result of the first network model and the prediction result of the second network model are mutually verified, the training sample set after the screening is screened again, and possibly existing dirty data in the training sample can be further screened, so that the model is trained by using the training sample set after the screening, and the accuracy of the trained model can be further improved.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 2, the electronic device 200 includes:

a first input module 201, configured to input a training sample set into a first network model for setting a label, to obtain at least one label corresponding to the training sample set, where the training sample set includes at least two voice samples, and each label in the at least one label corresponds to a frame of voice signal or multiple frames of continuous voice signals of the at least two voice samples, respectively;

a determining module 202, configured to determine a duration of a voice signal corresponding to each tag in the at least one tag respectively;

the first screening module 203 is configured to screen the voice samples in the training sample set based on a duration of the voice signal corresponding to each tag in the at least one tag, so as to obtain a screened training sample set.

Optionally, the at least one tag includes K different tags, where K is a positive integer, and as shown in fig. 3, the first filtering module 203 includes:

a calculating unit 2031, configured to calculate an average duration of the voice signal corresponding to each of the K kinds of tags based on a duration of the voice signal corresponding to each of the at least one tag;

the screening unit 2032 is configured to screen the voice samples in the training sample set based on the average duration of the voice signal corresponding to each of the K kinds of labels, so as to obtain a screened training sample set.

Optionally, the screening unit 2031 is specifically configured to:

Optionally, the at least one tag includes a mute tag and a non-mute tag, as shown in fig. 4, the electronic device 200 further includes:

a second input module 204, configured to input the voice samples in the screened training sample set into a second network model for voice recognition, so as to obtain a voice segment;

a second screening module 205, configured to screen the voice samples in the set of screened training samples based on the voice signal corresponding to the voice segment and the non-silent label.

Optionally, the second screening module 205 is specifically configured to:

The electronic device can implement each process implemented in the method embodiment of fig. 1, and is not described here again to avoid repetition.

Referring to fig. 5, fig. 5 is a fourth schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 5, the electronic device 300 includes: a memory 302, a processor 301, and a program stored on the memory 302 and executable on the processor 301, wherein:

the processor 301 reads the program in the memory 302 for executing:

Optionally, the at least one tag includes K different tags, where K is a positive integer, and the processor 301 is configured to perform the step of screening, based on a duration of a voice signal corresponding to each tag in the at least one tag, voice samples in the training sample set to obtain a screened training sample set, where the step includes:

Optionally, the processing unit 301 is configured to perform the step of screening the voice samples in the training sample set based on the average duration of the voice signal corresponding to each of the K kinds of labels, to obtain a screened training sample set, where the step includes:

Optionally, the at least one tag includes a mute tag and a non-mute tag, and the processor 301 is further configured to:

Optionally, the screening, performed by the processor 301, of the voice samples in the training sample set after the screening based on the voice signal corresponding to the vocal voice segment and the non-mute label includes:

In fig. 5, the bus architecture may include any number of interconnected buses and bridges, with one or more processors represented by processor 301 and various circuits of memory represented by memory 302 being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface.

The processor 301 is responsible for managing the bus architecture and general processing, and the memory 302 may store data used by the processor 301 in performing operations.

It should be noted that any implementation manner in the method embodiment of the present invention may be implemented by the electronic device in this embodiment, and achieve the same beneficial effects, and details are not described here.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above training sample screening method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A training sample screening method, comprising:

2. The method of claim 1, wherein the at least one tag includes K different tags, where K is a positive integer, and the screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each tag in the at least one tag to obtain a screened training sample set includes:

3. The method according to claim 2, wherein the screening the voice samples in the training sample set based on the average duration of the voice signal corresponding to each of the K kinds of tags to obtain a screened training sample set includes:

4. The method of claim 1, wherein the at least one tag comprises a mute tag and a non-mute tag, the method further comprising:

5. The method according to claim 4, wherein the screening the voice samples in the screened training sample set based on the voice signal of the vocal voice segment corresponding to the non-mute label comprises:

6. An electronic device, characterized in that the electronic device comprises:

7. The electronic device of claim 6, wherein the at least one tag comprises K different tags, K being a positive integer, the first filtering module comprising:

the calculation unit is used for respectively calculating the average duration of the voice signals corresponding to each label in the K kinds of labels based on the duration of the voice signals corresponding to each label in the at least one label;

and the screening unit is used for screening the voice samples in the training sample set based on the average duration of the voice signal corresponding to each label in the K kinds of labels to obtain the screened training sample set.

8. The electronic device of claim 7, wherein the screening unit is specifically configured to:

9. The electronic device of claim 6, wherein the at least one tag comprises a mute tag and a non-mute tag, the electronic device further comprising:

the second input module is used for inputting the voice samples in the screened training sample set into a second network model for voice recognition to obtain voice sections;

and the second screening module is used for screening the voice samples in the screened training sample set based on the voice signals corresponding to the voice sections and the non-mute labels.

10. The electronic device of claim 9, wherein the second filtering module is specifically configured to:

11. An electronic device, comprising: memory, a processor and a program stored on the memory and executable on the processor, the program, when executed by the processor, implementing the steps in the training sample screening method of any one of claims 1 to 5.