CN111554270A - Training sample screening method and electronic equipment - Google Patents

Training sample screening method and electronic equipment Download PDF

Info

Publication number
CN111554270A
CN111554270A CN202010354551.3A CN202010354551A CN111554270A CN 111554270 A CN111554270 A CN 111554270A CN 202010354551 A CN202010354551 A CN 202010354551A CN 111554270 A CN111554270 A CN 111554270A
Authority
CN
China
Prior art keywords
voice
label
training sample
sample set
tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010354551.3A
Other languages
Chinese (zh)
Other versions
CN111554270B (en
Inventor
许孝先
冯大航
陈孝良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202010354551.3A priority Critical patent/CN111554270B/en
Publication of CN111554270A publication Critical patent/CN111554270A/en
Application granted granted Critical
Publication of CN111554270B publication Critical patent/CN111554270B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention provides a training sample screening method and electronic equipment, wherein the method comprises the following steps: inputting a training sample set into a first network model for setting labels to obtain at least one label corresponding to the training sample set, wherein the training sample set comprises at least two voice samples, and each label in the at least one label respectively corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples; respectively determining the duration of the voice signal corresponding to each tag in the at least one tag; and screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set. The embodiment of the invention can improve the accuracy of the trained model.

Description

Training sample screening method and electronic equipment
Technical Field
The invention relates to the technical field of voice processing, in particular to a training sample screening method and electronic equipment.
Background
Natural language refers to a language for communication between humans generated through natural evolution. Natural Language Processing (NLP) is a field of computer science, artificial intelligence, linguistics that focuses on the interaction between computer and human (Natural) Language. The natural language processing technology can process the voice based on the network model, and meets the requirements of various use scenes, for example, in a voice recognition use scene, the voice can be processed into words based on an acoustic model, a language model and a decoder.
Before speech is processed based on the network model, the model needs to be trained using training samples. In the process of training the model, training samples are input into the network model according to frames, and parameters of the network model are updated based on target output corresponding to at least one input frame of voice. When dirty data is present in the training samples, the target output may not be accurately determined, resulting in less accuracy of the trained model.
Disclosure of Invention
The embodiment of the invention provides a training sample screening method and electronic equipment, and aims to solve the problem that in the prior art, when dirty data exists in a training sample, target output cannot be accurately determined, and therefore the accuracy of a trained model is low.
In order to solve the technical problem, the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a training sample screening method, where the method includes:
inputting a training sample set into a first network model for setting labels to obtain at least one label corresponding to the training sample set, wherein the training sample set comprises at least two voice samples, and each label in the at least one label respectively corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples;
respectively determining the duration of the voice signal corresponding to each tag in the at least one tag;
and screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set.
In a second aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:
a first input module, configured to input a training sample set into a first network model for setting a label, to obtain at least one label corresponding to the training sample set, where the training sample set includes at least two voice samples, and each label in the at least one label corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples, respectively;
the determining module is used for respectively determining the duration of the voice signal corresponding to each tag in the at least one tag;
and the first screening module is used for screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor and a program stored on the memory and executable on the processor, the program, when executed by the processor, implementing the steps in the training sample screening method according to the first aspect.
In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the training sample screening method according to the first aspect.
In the embodiment of the invention, a training sample set is input into a first network model for setting labels to obtain at least one label corresponding to the training sample set, wherein the training sample set comprises at least two voice samples, and each label in the at least one label respectively corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples; respectively determining the duration of the voice signal corresponding to each tag in the at least one tag; and screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set. Therefore, possibly existing dirty data in the training sample can be screened out based on the duration of the voice signal corresponding to each label in the at least one label, so that the model is trained by adopting the screened training sample set, and the accuracy of the trained model can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a flow chart of a training sample screening method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;
fig. 3 is a second schematic structural diagram of an electronic device according to an embodiment of the invention;
fig. 4 is a third schematic structural diagram of an electronic apparatus according to an embodiment of the present invention;
fig. 5 is a fourth schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted mobile terminal, a wearable device, a server, a pedometer, and the like.
Referring to fig. 1, fig. 1 is a flowchart of a training sample screening method according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:
step 101, inputting a training sample set into a first network model for setting labels to obtain at least one label corresponding to the training sample set, where the training sample set includes at least two voice samples, and each label in the at least one label corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples.
All the voice samples in the training sample set can be input into the first network model, and the labels are set for all the voice samples in the training sample set. All the voice samples in the training sample set can be input into the first network model according to each frame of voice signal, and the corresponding label of each frame of voice signal can be obtained. The speech samples in the training sample set may be framed, features may be extracted from each frame of speech signal, for example, MFCC (Mel-scale frequency cepstral coefficients) features or PCEN (Per-Channel Energy Normalization) features may be extracted, and the speech frame with the features extracted may be input into the first network model. Taking the labels as the phoneme labels, in a certain speech sample, the phoneme labels corresponding to the speech signals of the first frame to the tenth frame may be "k", and the phoneme labels corresponding to the speech signals of the eleventh frame to the eighteenth frame may be "i".
In addition, the first network model may be a Neural network model, for example, the first network model may be a DNN (Deep Neural Networks) model, or may also be a DNN-HMM (Hidden markov model) model, or may be a CNN (Convolutional Neural Networks) model, or the like. The first network model may be used to set labels to implement alignment operations. The first network Model may be trained by using a GMM (Gaussian Mixture Model) -HMM network Model or other neural network Model to obtain a training sample of the first network Model. The tags may be phoneme tags, for example, may be mono-phoneme tags, or tri-phoneme tags, or may be other tags as well.
And 102, respectively determining the duration of the voice signal corresponding to each tag in the at least one tag.
If a certain tag corresponds to a frame of voice signal, the duration of the voice signal corresponding to the tag may be the duration of a frame of voice signal, and if a certain tag corresponds to multiple frames of continuous voice signals, the duration of the voice signal corresponding to the tag may be the total duration of the multiple frames of continuous voice signals. For example, in a certain speech sample, the phoneme label corresponding to the first to tenth frames of speech signals is "k", the duration of the speech signal corresponding to the "k" label may be the total duration of ten frames of speech signals, and if the duration of one frame of speech signal is 25ms, the duration of the speech signal corresponding to the "k" label may be 250 ms.
103, screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set.
The at least one tag may include K different tags, where K is a positive integer, the voice samples in the training sample set are screened based on the duration of the voice signal corresponding to each tag in the at least one tag to obtain a screened training sample set, and the average duration of the voice signal corresponding to each tag in the K tags is respectively calculated based on the duration of the voice signal corresponding to each tag in the at least one tag, and the voice samples in the training sample set are screened based on the average duration of the voice signal corresponding to each tag in the K tags to obtain the screened training sample set.
Or, the speech samples in the training sample set are screened based on the duration of the speech signal corresponding to each tag in the at least one tag, so as to obtain a screened training sample set, or, if a second tag exists in the at least one tag, the speech sample to which the speech signal corresponding to the second tag belongs is deleted from the training sample set, where the duration of the speech signal corresponding to the second tag is greater than a preset duration.
In practical applications, for example, voice recognition is used, and voice can be converted into words through voice recognition, and voice can be converted into words through a neural network model. Prior to speech recognition, a neural network model may be trained using a set of training samples. The speech samples in the training sample set may be framed, features, such as MFCC features or PCEN features, may be extracted from each frame, and the speech frames after the features are extracted may be input to a neural network model for training, where a learning target of the neural network model may be a phoneme. For a neural network model trained on a frame basis, it is necessary to determine the approximate correct target for each frame, i.e., the alignment operation. When dirty data exists in the training sample set, for example, noise exists in the samples, the labels of the noise part may have errors of tens or even hundreds of frames, the problem is large, and the accuracy of the trained model is low when the training sample set with the dirty data is used for model training. In the embodiment of the invention, the possibly existing dirty data in the training sample is screened out based on the duration of the voice signal corresponding to each label in the at least one label, so that the model is trained by adopting the screened training sample set, and the accuracy of the trained model can be improved.
In the embodiment of the invention, a training sample set is input into a first network model for setting labels to obtain at least one label corresponding to the training sample set, wherein the training sample set comprises at least two voice samples, and each label in the at least one label respectively corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples; respectively determining the duration of the voice signal corresponding to each tag in the at least one tag; and screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set. Therefore, possibly existing dirty data in the training sample can be screened out based on the duration of the voice signal corresponding to each label in the at least one label, so that the model is trained by adopting the screened training sample set, and the accuracy of the trained model can be improved.
Optionally, the selecting, by the at least one tag, at least one label including K different labels, where K is a positive integer, and the selecting, based on a duration of a voice signal corresponding to each label in the at least one label, the voice samples in the training sample set to obtain a filtered training sample set includes:
respectively calculating the average time length of the voice signal corresponding to each label in the K kinds of labels based on the time length of the voice signal corresponding to each label in the at least one label;
and screening the voice samples in the training sample set based on the average duration of the voice signal corresponding to each label in the K labels to obtain a screened training sample set.
Wherein a plurality of identical tags may be included in the at least one tag as a single tag, for example, the labels are phoneme labels, the at least one label may include a label a, a label b, a label c and a label d, the label a may be a label "k" corresponding to the first to tenth frames of speech signals in the first speech sample, the label b may be a label "i" corresponding to the eleventh to eighteenth frames of speech signals in the first speech sample, the label c may be a label "k" corresponding to the twenty-first to twenty-eighth frames of speech signals in the second speech sample, and the label d may be a label "k" corresponding to the thirty-first to forty frames of speech signals in the second speech sample, it can be considered that the label a and the label c are one of K kinds of labels, and the label b and the label d are the other of the K kinds of labels.
In addition, each of the tags may include one tag or multiple tags, and if a certain tag includes one tag, the average duration of the voice signal corresponding to the certain tag may be the duration of the voice signal corresponding to the tag included in the certain tag; if a certain tag includes m tags, the average duration of the voice signals corresponding to the tag may be a quotient of m and the total duration of the voice signals corresponding to the m tags included in the tag, where m is a positive integer greater than 1. For example, if a certain tag includes a tag a and a tag c, if the duration of the voice signal corresponding to the tag a is 250ms, and the duration of the voice signal corresponding to the tag c is 200ms, the average duration of the voice signal corresponding to the certain tag may be: (250+200)/2, i.e., 225 ms.
Further, based on the average duration of the voice signal corresponding to each of the K kinds of tags, the voice samples in the training sample set are screened to obtain a screened training sample set, wherein if a target label exists in the at least one label, deleting the voice sample to which the voice signal corresponding to the target label belongs from the training sample set, wherein the duration of the voice signal corresponding to the target tag is greater than a first threshold, or the duration of the voice signal corresponding to the target tag is less than a second threshold, the first threshold is the product of a first preset coefficient and the average duration of the voice signals corresponding to the tags of the category to which the target tag belongs, the second threshold is a product of a second preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, and the first preset coefficient is greater than the second preset coefficient.
In this embodiment, based on the average duration of the voice signal corresponding to each of the K kinds of labels, the voice samples in the training sample set are screened, so that the screening accuracy can be improved, and therefore, the model is trained by using the screened training sample set, and the accuracy of the trained model can be further improved.
Optionally, the screening, based on the average duration of the voice signal corresponding to each of the K kinds of labels, the voice samples in the training sample set to obtain a screened training sample set includes:
if a target label exists in the at least one label, deleting the voice sample to which the voice signal corresponding to the target label belongs from the training sample set;
the duration of the voice signal corresponding to the target tag is greater than a first threshold, or the duration of the voice signal corresponding to the target tag is less than a second threshold, the first threshold is a product of a first preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, the second threshold is a product of a second preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, and the first preset coefficient is greater than the second preset coefficient.
The first preset coefficient may be a number greater than 1, and the second preset coefficient may be a number less than 1, so as to adapt to different speeds of speech of different people. The first predetermined factor may be 2, or 5, or 10, etc., and the second predetermined factor may be 0.1, or 0.3, or 0.7, etc. In practical applications, the first preset coefficient may be set to be 3, and the second preset coefficient is 1/3, so that the speech samples with wrong alignment can be screened out without damaging the speech samples in the training sample set as much as possible.
In this embodiment, if a target tag exists in the at least one tag, the voice sample to which the voice signal corresponding to the target tag belongs is deleted from the training sample set, so that dirty data possibly existing in the training sample can be further accurately screened out, and the model training performance is improved.
Optionally, the at least one tag includes a mute tag and a non-mute tag, and the method further includes:
inputting the voice samples in the screened training sample set into a second network model for voice recognition to obtain voice sections;
and screening the voice samples in the screened training sample set based on the voice signals corresponding to the voice sections and the non-mute labels.
Wherein the second network model is used for human voice recognition, the second network model may be a neural network model, for example, may be a DNN model, or may be a CNN model, or may also be a recurrent neural network model, and so on. The speech samples in the screened training sample set can be input into the second network model by frames, and the second network model can predict that each frame of input signals are speech signals of a human speech segment, or signals of a mute segment, or noise signals. The voice samples in the set of training samples after being screened are screened based on the voice signal corresponding to the voice segment and the non-silent label, or the voice samples in the set of training samples after being screened are screened based on the comparison result by comparing each frame of voice signal in the voice segment with the voice signal corresponding to the non-silent label. The screening of the voice samples in the set of screened training samples based on the comparison result may be that, if the deviation between the voice signal range of the voice speech segment and the voice signal range of the voice signal corresponding to the non-mute label exceeds a preset value, the voice sample to which the voice speech segment belongs may be deleted, where the preset value may be 3, or may be 5, or may be 10, and so on.
In this embodiment, the voice samples in the selected training sample set are input into the second network model for voice recognition to obtain a voice segment, and the voice samples in the selected training sample set are selected based on the voice segment and the voice signal corresponding to the non-silent label, so that the first screening can be performed based on the first network model, and the second screening can be performed based on the first network model and the second network model.
Optionally, the screening the voice samples in the training sample set after being screened based on the voice signal of the voice segment corresponding to the non-silent label includes:
if N frames of voice signals exist between a first frame of voice signal in the voice section and a first frame of voice signal in the voice signal corresponding to a first label, deleting the voice sample to which the voice section belongs from the screened training sample set, wherein the first label is a non-mute label corresponding to the voice section, and N is larger than a first preset value;
or, if M frames of voice signals exist between the last frame of voice signal in the voice segment and the last frame of voice signal in the voice signal corresponding to the first tag, deleting the voice sample to which the voice segment belongs from the screened training sample set, where M is greater than a second preset value.
Wherein, the values of N and M can be the same or different. The value of N may be 3, or may be 5, or may be 10, etc. The value of M may be 3, or may be 5, or may be 10, etc. For example, N and M have the same value and are both 5, in this example, the training sample set includes a third voice sample, the plurality of vocal segments of the third voice sample can be obtained through the second network model, taking the first vocal segment of the third voice sample as an example, the first vocal segment of the third voice sample can include the first frame to the fiftieth frame voice signals of the third voice sample, the plurality of labels of the third voice sample can be obtained through the first network model, the first non-silent label of the third voice sample can be the non-silent label corresponding to the first vocal segment, the first non-silent label of the third voice sample can include the first frame to the sixteenth frame voice signals of the third voice sample, there is 10 frame voice signals between the last frame voice signal in the first vocal segment and the last frame voice signal in the voice signal corresponding to the first non-silent label, a value greater than M, the third speech sample may be deleted from the set of screened training samples.
In addition, the first network model may be used to align the voice samples in the training sample set during the first screening, and the second network model may be used to align the voice samples in the training sample set after the first screening during the second screening. Through the two alignments, dirty data that may be present in the training samples may be screened out.
In this embodiment, if N frames of speech signals exist between a first frame of speech signal in the speech segment of the human voice and a first frame of speech signal in the speech signal corresponding to the first tag, or M frames of speech signals exist between a last frame of speech signal in the speech segment of the human voice and a last frame of speech signal in the speech signal corresponding to the first tag, the speech sample to which the speech segment of the human voice belongs is deleted from the training sample set after the screening, so that the prediction result of the first network model and the prediction result of the second network model are mutually verified, the training sample set after the screening is screened again, and possibly existing dirty data in the training sample can be further screened, so that the model is trained by using the training sample set after the screening, and the accuracy of the trained model can be further improved.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 2, the electronic device 200 includes:
a first input module 201, configured to input a training sample set into a first network model for setting a label, to obtain at least one label corresponding to the training sample set, where the training sample set includes at least two voice samples, and each label in the at least one label corresponds to a frame of voice signal or multiple frames of continuous voice signals of the at least two voice samples, respectively;
a determining module 202, configured to determine a duration of a voice signal corresponding to each tag in the at least one tag respectively;
the first screening module 203 is configured to screen the voice samples in the training sample set based on a duration of the voice signal corresponding to each tag in the at least one tag, so as to obtain a screened training sample set.
Optionally, the at least one tag includes K different tags, where K is a positive integer, and as shown in fig. 3, the first filtering module 203 includes:
a calculating unit 2031, configured to calculate an average duration of the voice signal corresponding to each of the K kinds of tags based on a duration of the voice signal corresponding to each of the at least one tag;
the screening unit 2032 is configured to screen the voice samples in the training sample set based on the average duration of the voice signal corresponding to each of the K kinds of labels, so as to obtain a screened training sample set.
Optionally, the screening unit 2031 is specifically configured to:
if a target label exists in the at least one label, deleting the voice sample to which the voice signal corresponding to the target label belongs from the training sample set;
the duration of the voice signal corresponding to the target tag is greater than a first threshold, or the duration of the voice signal corresponding to the target tag is less than a second threshold, the first threshold is a product of a first preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, the second threshold is a product of a second preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, and the first preset coefficient is greater than the second preset coefficient.
Optionally, the at least one tag includes a mute tag and a non-mute tag, as shown in fig. 4, the electronic device 200 further includes:
a second input module 204, configured to input the voice samples in the screened training sample set into a second network model for voice recognition, so as to obtain a voice segment;
a second screening module 205, configured to screen the voice samples in the set of screened training samples based on the voice signal corresponding to the voice segment and the non-silent label.
Optionally, the second screening module 205 is specifically configured to:
if N frames of voice signals exist between a first frame of voice signal in the voice section and a first frame of voice signal in the voice signal corresponding to a first label, deleting the voice sample to which the voice section belongs from the screened training sample set, wherein the first label is a non-mute label corresponding to the voice section, and N is larger than a first preset value;
or, if M frames of voice signals exist between the last frame of voice signal in the voice segment and the last frame of voice signal in the voice signal corresponding to the first tag, deleting the voice sample to which the voice segment belongs from the screened training sample set, where M is greater than a second preset value.
The electronic device can implement each process implemented in the method embodiment of fig. 1, and is not described here again to avoid repetition.
Referring to fig. 5, fig. 5 is a fourth schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 5, the electronic device 300 includes: a memory 302, a processor 301, and a program stored on the memory 302 and executable on the processor 301, wherein:
the processor 301 reads the program in the memory 302 for executing:
inputting a training sample set into a first network model for setting labels to obtain at least one label corresponding to the training sample set, wherein the training sample set comprises at least two voice samples, and each label in the at least one label respectively corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples;
respectively determining the duration of the voice signal corresponding to each tag in the at least one tag;
and screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set.
Optionally, the at least one tag includes K different tags, where K is a positive integer, and the processor 301 is configured to perform the step of screening, based on a duration of a voice signal corresponding to each tag in the at least one tag, voice samples in the training sample set to obtain a screened training sample set, where the step includes:
respectively calculating the average time length of the voice signal corresponding to each label in the K kinds of labels based on the time length of the voice signal corresponding to each label in the at least one label;
and screening the voice samples in the training sample set based on the average duration of the voice signal corresponding to each label in the K labels to obtain a screened training sample set.
Optionally, the processing unit 301 is configured to perform the step of screening the voice samples in the training sample set based on the average duration of the voice signal corresponding to each of the K kinds of labels, to obtain a screened training sample set, where the step includes:
if a target label exists in the at least one label, deleting the voice sample to which the voice signal corresponding to the target label belongs from the training sample set;
the duration of the voice signal corresponding to the target tag is greater than a first threshold, or the duration of the voice signal corresponding to the target tag is less than a second threshold, the first threshold is a product of a first preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, the second threshold is a product of a second preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, and the first preset coefficient is greater than the second preset coefficient.
Optionally, the at least one tag includes a mute tag and a non-mute tag, and the processor 301 is further configured to:
inputting the voice samples in the screened training sample set into a second network model for voice recognition to obtain voice sections;
and screening the voice samples in the screened training sample set based on the voice signals corresponding to the voice sections and the non-mute labels.
Optionally, the screening, performed by the processor 301, of the voice samples in the training sample set after the screening based on the voice signal corresponding to the vocal voice segment and the non-mute label includes:
if N frames of voice signals exist between a first frame of voice signal in the voice section and a first frame of voice signal in the voice signal corresponding to a first label, deleting the voice sample to which the voice section belongs from the screened training sample set, wherein the first label is a non-mute label corresponding to the voice section, and N is larger than a first preset value;
or, if M frames of voice signals exist between the last frame of voice signal in the voice segment and the last frame of voice signal in the voice signal corresponding to the first tag, deleting the voice sample to which the voice segment belongs from the screened training sample set, where M is greater than a second preset value.
In fig. 5, the bus architecture may include any number of interconnected buses and bridges, with one or more processors represented by processor 301 and various circuits of memory represented by memory 302 being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface.
The processor 301 is responsible for managing the bus architecture and general processing, and the memory 302 may store data used by the processor 301 in performing operations.
It should be noted that any implementation manner in the method embodiment of the present invention may be implemented by the electronic device in this embodiment, and achieve the same beneficial effects, and details are not described here.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above training sample screening method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (11)

1. A training sample screening method, comprising:
inputting a training sample set into a first network model for setting labels to obtain at least one label corresponding to the training sample set, wherein the training sample set comprises at least two voice samples, and each label in the at least one label respectively corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples;
respectively determining the duration of the voice signal corresponding to each tag in the at least one tag;
and screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set.
2. The method of claim 1, wherein the at least one tag includes K different tags, where K is a positive integer, and the screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each tag in the at least one tag to obtain a screened training sample set includes:
respectively calculating the average time length of the voice signal corresponding to each label in the K kinds of labels based on the time length of the voice signal corresponding to each label in the at least one label;
and screening the voice samples in the training sample set based on the average duration of the voice signal corresponding to each label in the K labels to obtain a screened training sample set.
3. The method according to claim 2, wherein the screening the voice samples in the training sample set based on the average duration of the voice signal corresponding to each of the K kinds of tags to obtain a screened training sample set includes:
if a target label exists in the at least one label, deleting the voice sample to which the voice signal corresponding to the target label belongs from the training sample set;
the duration of the voice signal corresponding to the target tag is greater than a first threshold, or the duration of the voice signal corresponding to the target tag is less than a second threshold, the first threshold is a product of a first preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, the second threshold is a product of a second preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, and the first preset coefficient is greater than the second preset coefficient.
4. The method of claim 1, wherein the at least one tag comprises a mute tag and a non-mute tag, the method further comprising:
inputting the voice samples in the screened training sample set into a second network model for voice recognition to obtain voice sections;
and screening the voice samples in the screened training sample set based on the voice signals corresponding to the voice sections and the non-mute labels.
5. The method according to claim 4, wherein the screening the voice samples in the screened training sample set based on the voice signal of the vocal voice segment corresponding to the non-mute label comprises:
if N frames of voice signals exist between a first frame of voice signal in the voice section and a first frame of voice signal in the voice signal corresponding to a first label, deleting the voice sample to which the voice section belongs from the screened training sample set, wherein the first label is a non-mute label corresponding to the voice section, and N is larger than a first preset value;
or, if M frames of voice signals exist between the last frame of voice signal in the voice segment and the last frame of voice signal in the voice signal corresponding to the first tag, deleting the voice sample to which the voice segment belongs from the screened training sample set, where M is greater than a second preset value.
6. An electronic device, characterized in that the electronic device comprises:
a first input module, configured to input a training sample set into a first network model for setting a label, to obtain at least one label corresponding to the training sample set, where the training sample set includes at least two voice samples, and each label in the at least one label corresponds to one frame of voice signals or multiple frames of continuous voice signals of the at least two voice samples, respectively;
the determining module is used for respectively determining the duration of the voice signal corresponding to each tag in the at least one tag;
and the first screening module is used for screening the voice samples in the training sample set based on the duration of the voice signal corresponding to each label in the at least one label to obtain a screened training sample set.
7. The electronic device of claim 6, wherein the at least one tag comprises K different tags, K being a positive integer, the first filtering module comprising:
the calculation unit is used for respectively calculating the average duration of the voice signals corresponding to each label in the K kinds of labels based on the duration of the voice signals corresponding to each label in the at least one label;
and the screening unit is used for screening the voice samples in the training sample set based on the average duration of the voice signal corresponding to each label in the K kinds of labels to obtain the screened training sample set.
8. The electronic device of claim 7, wherein the screening unit is specifically configured to:
if a target label exists in the at least one label, deleting the voice sample to which the voice signal corresponding to the target label belongs from the training sample set;
the duration of the voice signal corresponding to the target tag is greater than a first threshold, or the duration of the voice signal corresponding to the target tag is less than a second threshold, the first threshold is a product of a first preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, the second threshold is a product of a second preset coefficient and an average duration of the voice signal corresponding to the tag of the category to which the target tag belongs, and the first preset coefficient is greater than the second preset coefficient.
9. The electronic device of claim 6, wherein the at least one tag comprises a mute tag and a non-mute tag, the electronic device further comprising:
the second input module is used for inputting the voice samples in the screened training sample set into a second network model for voice recognition to obtain voice sections;
and the second screening module is used for screening the voice samples in the screened training sample set based on the voice signals corresponding to the voice sections and the non-mute labels.
10. The electronic device of claim 9, wherein the second filtering module is specifically configured to:
if N frames of voice signals exist between a first frame of voice signal in the voice section and a first frame of voice signal in the voice signal corresponding to a first label, deleting the voice sample to which the voice section belongs from the screened training sample set, wherein the first label is a non-mute label corresponding to the voice section, and N is larger than a first preset value;
or, if M frames of voice signals exist between the last frame of voice signal in the voice segment and the last frame of voice signal in the voice signal corresponding to the first tag, deleting the voice sample to which the voice segment belongs from the screened training sample set, where M is greater than a second preset value.
11. An electronic device, comprising: memory, a processor and a program stored on the memory and executable on the processor, the program, when executed by the processor, implementing the steps in the training sample screening method of any one of claims 1 to 5.
CN202010354551.3A 2020-04-29 2020-04-29 Training sample screening method and electronic equipment Active CN111554270B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010354551.3A CN111554270B (en) 2020-04-29 2020-04-29 Training sample screening method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010354551.3A CN111554270B (en) 2020-04-29 2020-04-29 Training sample screening method and electronic equipment

Publications (2)

Publication Number Publication Date
CN111554270A true CN111554270A (en) 2020-08-18
CN111554270B CN111554270B (en) 2023-04-18

Family

ID=72007826

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010354551.3A Active CN111554270B (en) 2020-04-29 2020-04-29 Training sample screening method and electronic equipment

Country Status (1)

Country Link
CN (1) CN111554270B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530409A (en) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 Voice sample screening method and device based on geometry and computer equipment
CN112561080A (en) * 2020-12-18 2021-03-26 Oppo(重庆)智能科技有限公司 Sample screening method, sample screening device and terminal equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040260550A1 (en) * 2003-06-20 2004-12-23 Burges Chris J.C. Audio processing system and method for classifying speakers in audio data
CN103594086A (en) * 2013-10-25 2014-02-19 鸿富锦精密工业(深圳)有限公司 Voice processing system, device and method
CN109376264A (en) * 2018-11-09 2019-02-22 广州势必可赢网络科技有限公司 A kind of audio-frequency detection, device, equipment and computer readable storage medium
CN109584886A (en) * 2018-12-04 2019-04-05 科大讯飞股份有限公司 Identity identifying method, device, equipment and storage medium based on Application on Voiceprint Recognition
CN110288976A (en) * 2019-06-21 2019-09-27 北京声智科技有限公司 Data screening method, apparatus and intelligent sound box
CN110297909A (en) * 2019-07-05 2019-10-01 中国工商银行股份有限公司 A kind of classification method and device of no label corpus
CN110598797A (en) * 2019-09-18 2019-12-20 北京明略软件系统有限公司 Fault detection method and device, storage medium and electronic device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040260550A1 (en) * 2003-06-20 2004-12-23 Burges Chris J.C. Audio processing system and method for classifying speakers in audio data
CN103594086A (en) * 2013-10-25 2014-02-19 鸿富锦精密工业(深圳)有限公司 Voice processing system, device and method
CN109376264A (en) * 2018-11-09 2019-02-22 广州势必可赢网络科技有限公司 A kind of audio-frequency detection, device, equipment and computer readable storage medium
CN109584886A (en) * 2018-12-04 2019-04-05 科大讯飞股份有限公司 Identity identifying method, device, equipment and storage medium based on Application on Voiceprint Recognition
CN110288976A (en) * 2019-06-21 2019-09-27 北京声智科技有限公司 Data screening method, apparatus and intelligent sound box
CN110297909A (en) * 2019-07-05 2019-10-01 中国工商银行股份有限公司 A kind of classification method and device of no label corpus
CN110598797A (en) * 2019-09-18 2019-12-20 北京明略软件系统有限公司 Fault detection method and device, storage medium and electronic device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530409A (en) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 Voice sample screening method and device based on geometry and computer equipment
CN112530409B (en) * 2020-12-01 2024-01-23 平安科技(深圳)有限公司 Speech sample screening method and device based on geometry and computer equipment
CN112561080A (en) * 2020-12-18 2021-03-26 Oppo(重庆)智能科技有限公司 Sample screening method, sample screening device and terminal equipment
CN112561080B (en) * 2020-12-18 2023-03-03 Oppo(重庆)智能科技有限公司 Sample screening method, sample screening device and terminal equipment

Also Published As

Publication number Publication date
CN111554270B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
CN108735201B (en) Continuous speech recognition method, device, equipment and storage medium
CN108428446A (en) Audio recognition method and device
CN106297800B (en) Self-adaptive voice recognition method and equipment
CN110556130A (en) Voice emotion recognition method and device and storage medium
CN110277088B (en) Intelligent voice recognition method, intelligent voice recognition device and computer readable storage medium
JP2023542685A (en) Speech recognition method, speech recognition device, computer equipment, and computer program
CN111653274B (en) Wake-up word recognition method, device and storage medium
CN112397056B (en) Voice evaluation method and computer storage medium
CN110675862A (en) Corpus acquisition method, electronic device and storage medium
CN112349289B (en) Voice recognition method, device, equipment and storage medium
CN113096647B (en) Voice model training method and device and electronic equipment
CN111554270B (en) Training sample screening method and electronic equipment
CN111832308A (en) Method and device for processing consistency of voice recognition text
CN112216284B (en) Training data updating method and system, voice recognition method and system and equipment
CN114127849A (en) Speech emotion recognition method and device
US11410685B1 (en) Method for detecting voice splicing points and storage medium
CN115457938A (en) Method, device, storage medium and electronic device for identifying awakening words
CN111798846A (en) Voice command word recognition method and device, conference terminal and conference terminal system
CN113823323B (en) Audio processing method and device based on convolutional neural network and related equipment
CN113539243A (en) Training method of voice classification model, voice classification method and related device
CN111785256A (en) Acoustic model training method and device, electronic equipment and storage medium
CN114783424A (en) Text corpus screening method, device, equipment and storage medium
CN114283828A (en) Training method of voice noise reduction model, voice scoring method, device and medium
CN112767928A (en) Voice understanding method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant