CN110379414B

CN110379414B - Acoustic model enhancement training method and device, readable storage medium and computing equipment

Info

Publication number: CN110379414B
Application number: CN201910661856.6A
Authority: CN
Inventors: 张彬彬
Original assignee: Go Out And Ask Suzhou Information Technology Co ltd
Current assignee: Go Out And Ask Suzhou Information Technology Co ltd
Priority date: 2019-07-22
Filing date: 2019-07-22
Publication date: 2021-12-03
Anticipated expiration: 2039-07-22
Also published as: CN110379414A

Abstract

The embodiment of the disclosure provides an acoustic model enhancement training method and device, a readable storage medium and computing equipment, which realize generation of enhancement training samples and training of an acoustic model based on a feature masking method. The method comprises the following steps: acquiring original voice data; generating a spectrogram according to the acoustic characteristics of the original voice data; performing at least one time of characteristic masking processing on the spectrogram in a time domain and/or a frequency domain to obtain at least one spectrogram after the characteristic masking processing; and performing enhancement training on the acoustic model according to the spectrogram after the at least one characteristic masking process.

Description

Acoustic model enhancement training method and device, readable storage medium and computing equipment

Technical Field

The present disclosure relates to the field of speech processing technologies, and in particular, to an acoustic model enhancement training method, an acoustic model enhancement training device, a readable storage medium, and a computing device.

Background

The speech recognition technology based on deep learning has become a mainstream technology in the industry and has achieved good effect in practical application. However, the following disadvantages still exist in some cases in the speech recognition system based on deep learning:

1. the recognition effect is good under the quiet and disturbance-free condition, but the recognition performance is greatly reduced under the conditions of noise, reverberation, distortion and the like;

2. the deep learning based speech recognition system relies on a large amount of training data, and when the amount of data is small, the model performance is low.

In order to solve the above problems, the industry mainly uses a data enhancement mode to enhance training data by adding noise, speed change, reverberation, and the like. The method can improve the robustness of the model to a certain extent.

The acoustic model scheme based on data enhancement is shown in fig. 1, samples of original speech are randomly selected from a noise/reverberation database through a data enhancement module, noise and reverberation are added to the original speech through a data enhancement algorithm, for a single original speech, if the above operations are performed for N times, N changed samples of the speech can be generated, and the whole training data is changed into N times of the original training data. And extracting the features of the enhanced data, and then applying the features to the training process of the acoustic model.

However, this method has the following disadvantages:

1. the noise adding and the reverberation depend on a database of the noise and the reverberation, and in order to optimize a specific scene, the related noise and the reverberation under the specific scene need to be recorded, so that the cost is higher;

2. noise such as reverberation and noise belongs to random signals, and all conditions cannot be recorded;

3. reverberation and noise both act on an original voice signal, so that all enhanced samples of training data need to be generated before an acoustic model is trained, and training can be performed after characteristics are extracted, so that the time and space costs are high.

Disclosure of Invention

To this end, the present disclosure provides an acoustic model enhancement training method, apparatus, readable storage medium and computing device in an effort to solve or at least mitigate at least one of the problems identified above.

According to an aspect of the embodiments of the present disclosure, there is provided an acoustic model enhancement training method, including:

acquiring original voice data;

generating a spectrogram according to the acoustic characteristics of the original voice data;

performing at least one time of characteristic masking processing on the spectrogram in a time domain and/or a frequency domain to obtain at least one spectrogram after the characteristic masking processing;

and performing enhancement training on the acoustic model according to the spectrogram after the at least one characteristic masking process.

Optionally, performing feature masking processing on the spectrogram in a time domain, including:

selecting at least one starting frame, and determining the number of frames with characteristic covering;

starting from at least one starting frame, the energy values within the frames of the number of frames with characteristic masking are all set to zero.

Optionally, the number of frames for feature masking is less than a preset first threshold.

Optionally, performing a feature masking process on the spectrogram in a frequency domain, including:

selecting at least one starting frequency sub-band, and determining the number of frequency sub-bands with characteristic covering;

starting from at least one starting frequency sub-band, the energy values within the sub-band of the number of frequency sub-bands whose characteristic is masked are all set to zero.

Optionally, the number of frequency subbands with characteristic masking is smaller than a preset second threshold.

Optionally, the method further comprises:

when it is determined that the acoustic model training converges, the enhancement training is ended.

Optionally, the acoustic features comprise:

FBank characteristics.

According to still another aspect of an embodiment of the present disclosure, there is provided an acoustic model enhancement training apparatus including:

the data acquisition unit is used for acquiring original voice data;

the feature extraction unit is used for generating a spectrogram according to the acoustic features of the original voice data;

the characteristic masking unit is used for performing at least one time of characteristic masking processing on the spectrogram in a time domain and/or a frequency domain to obtain at least one spectrogram after the characteristic masking processing;

and the enhancement training unit is used for carrying out enhancement training on the acoustic model according to the spectrogram after the at least one characteristic masking processing.

Optionally, the feature masking unit is configured to, when performing feature masking processing on the spectrogram in a time domain, specifically:

-setting all energy values within the frames of the number of frames for which the feature is masked to zero, starting from the at least one start frame.

Optionally, the feature masking unit is configured to, when performing feature masking processing on the spectrogram in a frequency domain, specifically:

Optionally, the enhanced training unit is further configured to:

Optionally, the acoustic features comprise:

FBank characteristics.

According to yet another aspect of an embodiment of the present disclosure, there is provided a readable storage medium having executable instructions thereon, which when executed, cause a computer to perform the operations included in the above-mentioned method.

According to yet another aspect of embodiments of the present disclosure, there is provided a computing device comprising: a processor; and a memory storing executable instructions that, when executed, cause the processor to perform operations included in the above-described methods.

According to the technical scheme provided by the embodiment of the disclosure, the voice data after the feature masking processing is used for carrying out the enhancement training on the acoustic model, the noise data does not need to be recorded, a large number of training samples can be rapidly generated, the time cost and the space cost of the acoustic model training are saved, and the robustness of the acoustic model is improved by enhancing the training result.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram of a prior art acoustic model enhancement training method;

FIG. 2 is a block diagram of an exemplary terminal device;

FIG. 3 is a flow diagram of an acoustic model enhancement training method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating the effect of feature masking processing according to an embodiment of the present disclosure;

FIG. 5 is yet another flow diagram of an acoustic model enhancement training method according to an embodiment of the present disclosure;

fig. 6 is a block diagram of an acoustic model enhancement training apparatus according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 2 is a block diagram of an example computing device 100 arranged to implement an acoustic model enhancement training method according to the present disclosure. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: the processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. the example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more programs 122, and program data 124. In some implementations, the program 122 can be configured to execute instructions on an operating system by one or more processors 104 using program data 124.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display terminal or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations.

Wherein the one or more programs 122 of the computing device 100 include instructions for performing an acoustic model enhancement training method according to the present disclosure.

Fig. 3 illustrates a flow diagram of an acoustic model enhancement training method 200 according to one embodiment of the present disclosure, the acoustic model enhancement training method 200 starting at step S210.

Step S210, acquiring original voice data.

According to the embodiment of the present disclosure, the original voice data refers to voice data that has not undergone data enhancement processing, and includes voice data recorded in a recording studio or voice data recorded by a mobile phone, a recorder, or the like.

Subsequently, in step S220, a spectrogram is generated from the acoustic features of the original speech data. According to an embodiment of the present disclosure, the acoustic feature may be an Fbank feature, which is one of the most commonly used features in speech recognition, and processes and extracts audio in a manner similar to human ears by using the characteristic that the response of human ears to sound frequency spectrum is non-linear.

The abscissa of the spectrogram is time, the ordinate is frequency, and coordinate point values are voice data energy. Because the three-dimensional information is expressed by adopting the two-dimensional plane, the size of the energy value is expressed by the color, and the deeper the color, the stronger the voice energy for expressing the point is.

Subsequently, in step S230, at least one feature masking process is performed on the spectrogram in the time domain and/or the frequency domain, so as to obtain at least one feature masked spectrogram.

The characteristic covering means that on the basis of extracting Fbank characteristics, a small segment or a plurality of small segments of characteristics are randomly selected in the time domain and the frequency domain and are set to be zero.

By randomly selecting one or more small segments of features in the time domain and the frequency domain for multiple times and setting the features to zero, multiple different spectrogram patterns can be generated, thereby forming a training sample set.

Subsequently, in step S240, the acoustic model is training-enhanced according to the spectrogram after the at least one feature masking process.

According to the embodiment of the present disclosure, the acoustic model is a deep learning model, and the model is subjected to enhancement training based on the training sample set formed in step S230. The acoustic model after enhanced training is reduced in the word Error Rate (CER) in the voice recognition task, and the recognition performance equivalent to that of the traditional scheme based on data enhancement is achieved.

Further, the step of performing feature masking processing on the spectrogram in the time domain comprises:

For example, a start point T is randomly selected on the time axis, and a length L1 is randomly selected, and all features in the [ T, T + L1] time range are set to 0. L1 should satisfy the length constraint, e.g., not more than 20 frames, 10ms for 1 frame.

For another example, at least one of the start points T1 and T2 … is randomly selected on the time axis, the lengths L11 and L12 … are randomly selected, and all the features in the time ranges of [ T1, T1+ L11], [ T2, T2+ L12] … are set to 0. The L11, L12 … should satisfy a length constraint that is related to the number of starting points.

Further, the step of performing feature masking processing on the spectrogram in the frequency domain comprises:

For example, a starting point F is randomly selected on the frequency axis, and a length L2 is randomly selected, and all features in the [ F, F + L2] frequency range are set to 0. L2 should satisfy the length constraint, e.g., not more than 10 frequency subbands; in the field of voice processing, a frequency band of 0-8 kHz can be divided into 40 frequency sub-bands, and the width of each frequency sub-band is inconsistent, so that the requirement of voice processing is met.

For another example, starting points F1 and F2 … are randomly selected on the frequency axis, lengths L21 and L22 … are randomly selected, and all features in the frequency ranges of [ F1, F1+ L21], [ F2, F2+ L22] are set to 0. The L21, L22 … should satisfy a length constraint that is related to the number of starting points.

Fig. 4 shows the original and masked features of a piece of audio, where lighter means higher energy and darker means lower energy, and the horizontal and vertical squares in fig. 4 indicate that the features in this range are all set to 0.

The acoustic model scheme based on feature masking is shown in fig. 5, the original speech is directly subjected to feature extraction, and in each iteration of training, features of each sample speech are dynamically masked and then directly applied to model training. The training process is iterated until the training converges. It can be seen that in this process, assuming that the data is iterated N times, N changed samples of the original data are also dynamically and randomly generated.

In summary, the present disclosure replaces the conventional method of pre-recording noise and obtains the enhanced training samples by determining suitable parameters and randomly selecting small segment features in the time domain and the frequency domain based on the parameters to perform feature masking.

Referring to fig. 6, an embodiment of the present disclosure provides an acoustic model enhancement training apparatus, including:

a data acquisition unit 310 for acquiring original voice data;

a feature extraction unit 320, configured to generate a spectrogram according to acoustic features of original voice data;

the feature masking unit 330 is configured to perform at least one feature masking process on the spectrogram in a time domain and/or a frequency domain to obtain at least one spectrogram after the feature masking process;

and the enhancement training unit 340 is configured to perform enhancement training on the acoustic model according to the spectrogram after the at least one feature masking process.

Optionally, the feature masking unit 330 is configured to, when performing feature masking processing on the spectrogram in the time domain, specifically:

Optionally, the feature masking unit 330 is configured to, when performing feature masking processing on the spectrogram in a frequency domain, specifically:

Optionally, the enhanced training unit 340 is further configured to:

Optionally, the acoustic features comprise:

FBank characteristics.

For specific definition of the acoustic model enhancement training apparatus, reference may be made to the above definition of the acoustic model enhancement training method, which is not described herein again.

It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the various methods of the present disclosure according to instructions in the program code stored in the memory.

By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.

It should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, disclosed aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Moreover, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purposes of this disclosure.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the disclosure as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present disclosure is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.

Claims

1. An acoustic model enhancement training method, comprising:

acquiring original voice data;

performing at least one time of characteristic masking treatment on the spectrogram in a time domain and/or a frequency domain to obtain at least one spectrogram subjected to the characteristic masking treatment; performing feature masking processing on the spectrogram in a time domain, wherein the feature masking processing comprises the following steps: selecting at least one starting frame, and determining the number of frames with characteristic covering; setting all energy values in frames of the number of frames with the characteristic covering to zero from the at least one starting frame; and performing feature masking treatment on the spectrogram in a frequency domain, wherein the feature masking treatment comprises the following steps: selecting at least one starting frequency sub-band, and determining the number of frequency sub-bands with characteristic covering; setting all energy values in the sub-bands of the number of the frequency sub-bands with the characteristic covered to zero from the at least one starting frequency sub-band;

performing enhancement training on the acoustic model according to the spectrogram after the at least one characteristic masking treatment;

wherein, in each iteration of the enhancement training, the spectrogram is dynamically subjected to feature masking processing until the acoustic model training converges.

2. The method of claim 1, wherein the number of frames for which the feature masks is less than a predetermined first threshold.

3. The method of claim 1, wherein the number of frequency subbands for which the feature masking is less than a predetermined second threshold.

4. The method of any one of claims 1-3, wherein the acoustic features comprise:

FBank characteristics.

5. An acoustic model enhancement training apparatus, comprising:

the data acquisition unit is used for acquiring original voice data;

the characteristic masking unit is used for performing at least one time of characteristic masking processing on the spectrogram in a time domain and/or a frequency domain to obtain at least one spectrogram subjected to the characteristic masking processing; performing feature masking processing on the spectrogram in a time domain, wherein the feature masking processing comprises the following steps: selecting at least one starting frame, and determining the number of frames with characteristic covering; setting all energy values in frames of the number of frames with the characteristic covering to zero from the at least one starting frame; and performing feature masking treatment on the spectrogram in a frequency domain, wherein the feature masking treatment comprises the following steps: selecting at least one starting frequency sub-band, and determining the number of frequency sub-bands with characteristic covering; setting all energy values in the sub-bands of the number of the frequency sub-bands with the characteristic covered to zero from the at least one starting frequency sub-band;

the enhancement training unit is used for carrying out enhancement training on the acoustic model according to the spectrogram after the at least one characteristic masking processing; wherein, in each iteration of the enhancement training, the spectrogram is dynamically subjected to feature masking processing until the acoustic model training converges.

6. A readable storage medium having executable instructions thereon that, when executed, cause a computer to perform the operations included in any one of claims 1-4.

7. A computing device, comprising:

a processor; and

a memory storing executable instructions that, when executed, cause the processor to perform the operations included in any one of claims 1-4.