CN111210810A

CN111210810A - Model training method and device

Info

Publication number: CN111210810A
Application number: CN201911304920.1A
Authority: CN
Inventors: 刘洋; 唐大闰
Original assignee: Miaozhen Information Technology Co Ltd
Current assignee: Miaozhen Information Technology Co Ltd
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2020-05-29

Abstract

The invention discloses a model training method and a model training device. Wherein, the method comprises the following steps: obtaining an original voice sample, wherein the original voice sample is used for training an original recognition model; adding target noise into an original voice sample to obtain a target voice sample, wherein the target noise is noise of multiple types; and training the original recognition model by using the original voice sample and the target voice sample to obtain a target recognition model, wherein the recognition accuracy of the target recognition model is greater than a first threshold value. The invention solves the technical problem of low model training efficiency in the related technology.

Description

Model training method and device

Technical Field

The invention relates to the field of computers, in particular to a model training method and device.

Background

In the related art, in the process of training a model by using a speech sample, generally, a high-quality speech sample is few. Therefore, the training efficiency of the model is low in the process of training the model.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a model training method and a model training device, which at least solve the technical problem of low model training efficiency in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a model training method, including: obtaining an original voice sample, wherein the original voice sample is used for training an original recognition model; adding target noise into the original voice sample to obtain a target voice sample, wherein the target noise is noise of multiple types; and training the original recognition model by using the original voice sample and the target voice sample to obtain a target recognition model, wherein the recognition accuracy of the target recognition model is greater than a first threshold value.

As an alternative example, the adding noise to the original speech sample to obtain the target speech sample includes: acquiring a plurality of types of original noise; and adding the original noise of each type in the plurality of types of original noise into the original voice sample to obtain a plurality of target voice samples.

As an alternative example, the adding noise to the original speech sample to obtain the target speech sample includes: acquiring target noise, wherein the target noise is noise obtained by combining multiple types of original noise; splitting the original voice sample into M parts to obtain M first voice samples; and adding first noise to each first voice sample to obtain M target voice samples, wherein the first noise is noise which is cut from the target noise and has the same length as that of the first voice sample.

As an alternative example, the adding noise to the original speech sample to obtain the target speech sample includes: acquiring a plurality of types of original noise; dividing the multiple types of original noise into a first target noise and a second target noise according to the decibel of each original noise, wherein the decibel of the first target noise is greater than a preset decibel, and the decibel of the second target noise is less than or equal to a target decibel; splitting the original voice sample into M parts to obtain M first voice samples; adding a second noise to each of the first voice samples to obtain M first target voice samples, wherein the second noise is a noise which is cut from one of the first target noises and has the same length as that of the first voice sample; adding a third noise to each of the first voice samples to obtain M second target voice samples, wherein the second noise is a noise which is cut from one of the second target noises and has the same length as that of the first voice sample; and determining the M first target voice samples and the M second target voice samples as the target voice samples to obtain 2M target voice samples.

As an optional example, after the training the original recognition model by using the target speech sample to obtain the target recognition model, the method further includes: acquiring target voice to be recognized; inputting the target voice into the target recognition model, wherein the target recognition model is used for recognizing the type or content of the target voice; and acquiring a recognition result output by the target recognition model, wherein the recognition result comprises the type or the content of the target voice.

According to another aspect of the embodiments of the present invention, there is also provided a model training apparatus, including: the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring an original voice sample, and the original voice sample is used for training an original recognition model; an adding unit, configured to add a target noise to the original voice sample to obtain a target voice sample, where the target noise is a plurality of types of noise; and the training unit is used for training the original recognition model by using the original voice sample and the target voice sample to obtain a target recognition model, wherein the recognition accuracy of the target recognition model is greater than a first threshold value.

As an optional example, the adding unit includes: the first acquisition module is used for acquiring a plurality of types of original noise; a first adding module, configured to add the original noise of each type in the multiple types of original noise into the original voice sample to obtain multiple target voice samples.

As an optional example, the adding unit includes: the second acquisition module is used for acquiring target noise, wherein the target noise is noise obtained by combining a plurality of types of original noise; the first splitting module is used for splitting the original voice sample into M parts to obtain M first voice samples; a second adding module, configured to add a first noise to each of the first voice samples to obtain M target voice samples, where the first noise is a noise that is extracted from the target noise and has a length that is the same as that of the first voice sample.

As an optional example, the adding unit includes: the third acquisition module is used for acquiring a plurality of types of original noise; a dividing module, configured to divide the multiple types of original noise into a first target noise and a second target noise according to a decibel of each of the original noises, where a decibel of the first target noise is greater than a predetermined decibel, and a decibel of the second target noise is less than or equal to the target decibel; the second splitting module is used for splitting the original voice sample into M parts to obtain M first voice samples; a third adding module, configured to add a second noise to each of the first voice samples to obtain M first target voice samples, where the second noise is a noise that is cut from one of the first target noises and has a length that is the same as that of the first voice sample; a fourth adding module, configured to add a third noise to each of the first voice samples to obtain M second target voice samples, where the second noise is a noise that is cut from one of the second target noises and has a length that is the same as that of the first voice sample; and a determining module, configured to determine the M first target voice samples and the M second target voice samples as the target voice samples, so as to obtain 2M target voice samples.

As an optional example, the apparatus further includes: a second obtaining unit, configured to obtain a target speech to be recognized after the target speech sample is used to train the original recognition model to obtain a target recognition model; an input unit configured to input the target speech into the target recognition model, wherein the target recognition model is configured to recognize a type or content of the target speech; and a third obtaining unit, configured to obtain a recognition result output by the target recognition model, where the recognition result includes a type or content of the target voice.

In the embodiment of the invention, an original voice sample is obtained, wherein the original voice sample is used for training an original recognition model; adding target noise into the original voice sample to obtain a target voice sample, wherein the target noise is noise of multiple types; the original recognition model is trained by using the original voice sample and the target voice sample to obtain a target recognition model, wherein the recognition accuracy of the target recognition model is greater than a first threshold value.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic flow diagram of an alternative model training method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an alternative model training apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiments of the present invention, there is provided a model training method, optionally, as an optional implementation manner, as shown in fig. 1, the method includes:

s102, obtaining an original voice sample, wherein the original voice sample is used for training an original recognition model;

s104, adding target noise into the original voice sample to obtain a target voice sample, wherein the target noise is noise of multiple types;

s106, training an original recognition model by using the original voice sample and the target voice sample to obtain a target recognition model, wherein the recognition accuracy of the target recognition model is larger than a first threshold value.

Alternatively, the model training method may be applied to, but not limited to, a terminal capable of calculating data, such as a mobile phone, a tablet computer, a notebook computer, a PC, and the like, and the terminal may interact with a server through a network, which may include, but is not limited to, a wireless network or a wired network. Wherein, this wireless network includes: WIFI and other networks that enable wireless communication. Such wired networks may include, but are not limited to: wide area networks, metropolitan area networks, and local area networks. The server may include, but is not limited to, any hardware device capable of performing computations.

Alternatively, the present solution may be applied, but not limited to, in a process of training a speech recognition model. Alternatively, in the prior art, in the process of training the speech recognition model, the number of the good-quality speech samples is small, so that the efficiency of training the speech recognition model is low. The recognition accuracy of the model is low. According to the model training method, after an original voice sample is obtained, a plurality of types of target noises are obtained at first, and then the target noises are added into the original voice sample. And obtaining a target voice sample. And then, training an original recognition model by using the original voice sample and the target voice sample to obtain a target recognition model. In the process, noise is added to the original training samples, so that the number of the training samples is increased. And then, the original recognition model is trained by using the original training sample and the target training sample, so that the recognition accuracy of the original recognition model can be improved.

Alternatively, the noise in the present scheme may be various types of noise. Different types of noise may have different signatures. Such as type 1, type 2, etc. There may be one or more of each type of noise.

Optionally, after the noise is acquired, the noise needs to be added to the original voice sample so as to acquire the target voice sample. There may be a plurality of methods of adding noise.

Firstly, acquiring a plurality of types of original noises; and adding each type of original noise in the multiple types of original noise into the original voice sample to obtain multiple target voice samples. For example, the original speech sample may be a1 minute period of speech, while the noise is of multiple types, such as a total of three types of noise. The 1 minute voice may be separately added with each type of noise resulting in three target voice samples. When added, the noise may be copied or stretched or truncated or compressed if it is not the same length of time as the original speech sample. For example, half a minute of noise may be copied and pasted, resulting in a one minute noise, or stretched to a one minute noise. While a two minute noise may be clipped to a one minute noise or reduced to a one minute noise.

Acquiring target noise, wherein the target noise is noise obtained by combining multiple types of original noise; splitting an original voice sample into M parts to obtain M first voice samples; and adding first noise to each first voice sample to obtain M target voice samples, wherein the first noise is noise which is intercepted from the target noise and has the same length as that of the first voice sample. For example, for noise, there are three 1 minute noises. The three one-minute noises are combined into one-minute noise, or spliced into one three-minute noise, so as to obtain the target noise. Then, after the original speech sample is taken, such as a one minute voice, the voice is divided into 3 parts, each for 20 seconds. A 20 second segment of noise randomly truncated from the target noise is added to each 20 second voice.

Thirdly, acquiring multiple types of original noise; dividing multiple types of original noise into first target noise and second target noise according to the decibel of each original noise, wherein the decibel of the first target noise is greater than a preset decibel, and the decibel of the second target noise is less than or equal to the target decibel; splitting an original voice sample into M parts to obtain M first voice samples; adding second noise to each first voice sample to obtain M first target voice samples, wherein the second noise is noise which is intercepted from one noise in the first target noise and has the same length as that of the first voice samples; adding third noise to each first voice sample to obtain M second target voice samples, wherein the second noise is noise which is intercepted from one noise in the second target noise and has the same length as that of the first voice samples; and determining the M first target voice samples and the M second target voice samples as target voice samples to obtain 2M target voice samples. For example, three 1 minute noises can be divided into two 1 minute noises with high decibels and one 1 minute noises with low decibels according to decibel magnitude. Then, an original voice sample is obtained, the original voice sample is split into 4 15 seconds of voices, and for each 15 seconds of voices, a 15 second first noise and a 15 second noise are added. A first noise of 15 seconds may randomly intercept 15 seconds from one of two 1 minute noises of high decibels, and a second noise of 15 seconds may randomly intercept 15 seconds from a1 minute noise of low decibels. So that two noisy speech sounds are captured for each 15 seconds of speech.

Using at least one of the methods described above, the target speech sample may be obtained after the original speech sample is obtained. The original recognition model is trained by using the target voice sample, so that the training efficiency of the original recognition model can be improved.

The following exemplifies a method of obtaining a target sample voice: step 1, recording voice of a user in a real scene, for example, when a waiter wears a recording device to serve a customer according to a conversation, the recording device collects audio data. This part of the data needs to be manually annotated. This portion of data is labeled "true data 0 (original speech sample)". And 2, preparing the marked audio data in a quiet environment, wherein the audio data can be purchased or obtained free, and the audio content is not related to a specific scene and marked as data 0. And 3, a noise data collecting module, namely collecting different types of noise data in a real scene, such as a restaurant, by the sound recording equipment, and marking the noise data as: "type 1", "type 2", …, "type n".

The method comprises the following steps: noise data of type 1,2, …, n of step 3 are "blended" into data 0, respectively, to form data 1,2, …, n. True data 0 and data 0,1,2,3, …, n being the final training data.

The method 2 comprises the following steps: and combining the noise data of the type 1,2, …, n in the step 3 to obtain noise data M. And 2, randomly dividing the data 0 in the step 2 into M parts of data, wherein M is large enough, randomly intercepting noises with equal duration from the noise data M for each part of data in the M parts, mixing the noises into the data, and obtaining data 1 after the M parts of data are traversed. The above process was repeated n-1 times to obtain data 2,3, … n. True data 0 and data 0,1, …. n are the final training data.

The method 3 comprises the following steps: and (3) dividing the noise data type in the step (3) into data A and data B according to the strong background noise and the weak background noise, setting the division standard according to the noise decibel, and determining the strong background noise when the noise decibel is greater than the experience threshold, and determining the weak background noise when the noise decibel is less than the experience threshold. Randomly dividing the data 0 in the step 2 into m parts, wherein the data is large enough, and for each part of the m parts, randomly selecting noise data with equal time length from the strong background noise data A, and mixing the noise data with the equal time length into the data to form data A1, A2, … and Am; similarly, from the weak background noise data B, noise data of equal duration is randomly selected and mixed into m pieces of data to form data B1, B2, …, Bm. The "true data 0" is added to the two mixed data to form the final training data a1, a2, … Am, B1, B2, …, Bm.

The present solution may be applied, but is not limited to, any process of training a speech recognition model.

For example, taking recognizing the content of the speech and converting the speech into text as an example, after the original speech is acquired, noise is added to the original speech to obtain a target speech, and the original recognition model is trained by using the original speech and the target speech to obtain the target recognition model. Then, the voice of the character to be converted is input to realize the conversion from the voice to the character.

Or, taking which dialect the tone of the speech is recognized as an example, after the original speech is obtained, adding noise to the original speech to obtain a target speech, and training an original recognition model by using the original speech and the target speech to obtain a target recognition model. Then, the speech to be recognized is input, and the dialect of the speech is output by the target recognition model.

Or, taking recognizing speech emotion as an example, after obtaining the original speech, adding noise to the original speech to obtain the target speech, and training the original recognition model by using the original speech and the target speech to obtain the target recognition model. And then inputting the voice to be recognized into the target recognition model, and outputting the corresponding emotion by the target recognition model.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiment of the invention, a model training device for implementing the model training method is also provided. As shown in fig. 2, the apparatus includes:

(1) a first obtaining unit 202, configured to obtain an original speech sample, where the original speech sample is used to train an original recognition model;

(2) an adding unit 204, configured to add a target noise to an original voice sample to obtain a target voice sample, where the target noise is a plurality of types of noise;

(3) a training unit 206, configured to train an original recognition model using the original speech sample and the target speech sample, to obtain a target recognition model, where a recognition accuracy of the target recognition model is greater than a first threshold.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of model training, comprising:

obtaining an original voice sample, wherein the original voice sample is used for training an original recognition model;

adding target noise into the original voice sample to obtain a target voice sample, wherein the target noise is noise of multiple types;

training the original recognition model by using the original voice sample and the target voice sample to obtain a target recognition model, wherein the recognition accuracy of the target recognition model is greater than a first threshold value.

2. The method of claim 1, wherein the adding noise to the original speech sample to obtain a target speech sample comprises:

acquiring a plurality of types of original noise;

and adding the original noise of each type in the multiple types of original noise into the original voice sample to obtain multiple target voice samples.

3. The method of claim 1, wherein the adding noise to the original speech sample to obtain a target speech sample comprises:

acquiring target noise, wherein the target noise is noise obtained by combining multiple types of original noise;

splitting the original voice sample into M parts to obtain M first voice samples;

and adding first noise to each first voice sample to obtain M target voice samples, wherein the first noise is noise which is intercepted from the target noise and has the same length as that of the first voice sample.

4. The method of claim 1, wherein the adding noise to the original speech sample to obtain a target speech sample comprises:

acquiring a plurality of types of original noise;

dividing the multiple types of original noise into a first target noise and a second target noise according to the decibel of each original noise, wherein the decibel of the first target noise is greater than a preset decibel, and the decibel of the second target noise is less than or equal to a target decibel;

adding second noise to each first voice sample to obtain M first target voice samples, wherein the second noise is noise which is cut from one of the first target noise and has the same length as that of the first voice sample;

adding third noise to each first voice sample to obtain M second target voice samples, wherein the second noise is noise which is cut from one of the second target noise and has the same length as that of the first voice sample;

and determining the M first target voice samples and the M second target voice samples as the target voice samples to obtain 2M target voice samples.

5. The method of any one of claims 1 to 4, wherein after the training the original recognition model using the original speech sample and the target speech sample to obtain a target recognition model, the method further comprises:

acquiring target voice to be recognized;

inputting the target voice into the target recognition model, wherein the target recognition model is used for recognizing the type or the content of the target voice;

and acquiring a recognition result output by the target recognition model, wherein the recognition result comprises the type or the content of the target voice.

6. A model training apparatus, comprising:

the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring an original voice sample, and the original voice sample is used for training an original recognition model;

the adding unit is used for adding target noise into the original voice sample to obtain a target voice sample, wherein the target noise is a plurality of types of noise;

and the training unit is used for training the original recognition model by using the original voice sample and the target voice sample to obtain a target recognition model, wherein the recognition accuracy of the target recognition model is greater than a first threshold value.

7. The apparatus according to claim 6, wherein the adding unit comprises:

the first acquisition module is used for acquiring a plurality of types of original noise;

a first adding module, configured to add, to the original voice samples, the original noise of each type in the multiple types of original noise to obtain multiple target voice samples.

8. The apparatus according to claim 6, wherein the adding unit comprises:

the second acquisition module is used for acquiring target noise, wherein the target noise is noise obtained by combining multiple types of original noise;

the first splitting module is used for splitting the original voice sample into M parts to obtain M first voice samples;

and a second adding module, configured to add a first noise to each of the first voice samples to obtain M target voice samples, where the first noise is a noise that is intercepted from the target noise and has a length that is the same as that of the first voice sample.

9. The apparatus according to claim 6, wherein the adding unit comprises:

the third acquisition module is used for acquiring a plurality of types of original noise;

a dividing module, configured to divide the multiple types of original noise into a first target noise and a second target noise according to a decibel of each original noise, where a decibel of the first target noise is greater than a predetermined decibel, and a decibel of the second target noise is less than or equal to a target decibel;

the second splitting module is used for splitting the original voice sample into M parts to obtain M first voice samples;

a third adding module, configured to add a second noise to each of the first voice samples to obtain M first target voice samples, where the second noise is a noise that is intercepted from one of the first target noises and has a length that is the same as that of the first voice sample;

a fourth adding module, configured to add a third noise to each of the first voice samples to obtain M second target voice samples, where the second noise is a noise that is intercepted from one of the second target noises and has a length that is the same as that of the first voice sample;

a determining module, configured to determine the M first target voice samples and the M second target voice samples as the target voice samples, so as to obtain 2M target voice samples.

10. The apparatus of any one of claims 6 to 9, further comprising:

the second obtaining unit is used for obtaining target voice to be recognized after the target voice sample is used for training the original recognition model to obtain a target recognition model;

an input unit, configured to input the target speech into the target recognition model, wherein the target recognition model is used to recognize a type or content of the target speech;

and the third acquisition unit is used for acquiring a recognition result output by the target recognition model, wherein the recognition result comprises the type or the content of the target voice.