CN110797008A

CN110797008A - Far-field speech recognition method, speech recognition model training method and server

Info

Publication number: CN110797008A
Application number: CN201810775407.XA
Authority: CN
Inventors: 薛少飞
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2020-02-14
Anticipated expiration: 2038-07-16
Also published as: CN110797008B; WO2020015546A1

Abstract

The application provides a far-field speech recognition method, a speech recognition model training method and a server, wherein the far-field speech recognition method comprises the following steps: acquiring voice data; determining whether the voice data is far-field voice data; and under the condition that the voice data is determined to be far-field voice data, recognizing the voice data through a voice recognition model, wherein the voice recognition model is obtained by training voice features obtained by performing band energy normalization on the voice features of the voice data according to time dimension information and frequency dimension information of the voice data. By utilizing the technical scheme provided by the embodiment of the application, time dimension information and frequency dimension information are introduced in the frequency band energy normalization process, so that the influence of time and frequency on the voice recognition accuracy can be weakened, remote voice recognition is carried out based on the voice recognition model, the recognition accuracy can be effectively improved, and the technical effect of effectively improving the recognition accuracy of the voice recognition model is achieved.

Description

Far-field speech recognition method, speech recognition model training method and server

Technical Field

The application belongs to the technical field of internet, and particularly relates to a far-field speech recognition method, a speech recognition model training method and a server.

Background

Far-field speech recognition is an important technology in the field of speech interaction, by which distant sounds can be recognized (e.g., speech within 1m to 5m can be recognized). Far-field speech recognition is mainly applied to the field of smart homes, for example, the far-field speech recognition can be applied to devices such as smart sound boxes and smart televisions, and can also be applied to the field of conference transcription.

However, since in a real environment, there are generally a lot of interference problems such as noise, multipath reflection and reverberation, the quality of the picked-up sound signal is degraded. For far-field speech recognition, the main cause of the degradation in recognition accuracy is the attenuation of speech energy due to distance.

How to effectively reduce the problem of high recognition accuracy of a speech model caused by speech energy attenuation is not an effective solution provided at present.

Disclosure of Invention

The application aims to provide a far-field speech recognition method, a speech recognition model training method and a server so as to achieve the purpose of improving the recognition accuracy of a speech recognition model.

The application provides a far-field speech recognition method, a speech recognition model training method and a server, which are realized as follows:

a far-field speech recognition method, comprising:

acquiring voice data;

determining whether the voice data is far-field voice data;

and under the condition that the voice data is determined to be far-field voice data, recognizing the voice data through a voice recognition model, wherein the voice recognition model is obtained by training voice features obtained by performing band energy normalization on the voice features of the voice data according to time dimension information and frequency dimension information of the voice data.

A method of speech recognition model training, comprising:

acquiring voice features after filtering processing, wherein the voice features are extracted from voice data;

performing band energy normalization on the voice features through time dimension information and frequency dimension information of the voice data;

and training the voice recognition model according to the voice characteristics obtained after the band energy is regulated.

A model training server comprising a processor and a memory for storing processor-executable instructions, the instructions when executed by the processor performing the steps of:

A computer readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the above-described method.

According to the far-field speech recognition method, the speech recognition model training method and the server, the frequency band energy normalization is carried out on the speech features after the filtering processing through the time dimension information and the frequency dimension information of the speech data; and training the voice recognition model according to the voice characteristics obtained after the band energy is regulated. Because time dimension information and frequency dimension information are introduced in the process of band energy normalization, the influence of time and frequency on the speech recognition accuracy can be weakened, and the recognition accuracy can be effectively improved by performing remote speech recognition based on the speech recognition model, so that the technical effect of effectively improving the recognition accuracy of the speech recognition model is achieved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a flow chart of a method for extracting Filter-Bank phonetic features;

FIG. 2 is a flow chart of a method of extracting static PECN speech features;

FIG. 3 is a flow chart of a method of speech recognition model training provided herein;

FIG. 4 is a schematic diagram of a scenario of speech feature determination provided herein;

FIG. 5 is a schematic diagram of a training model provided herein;

FIG. 6 is an architecture diagram of a model training server provided herein;

fig. 7 is a block diagram of a speech recognition model training apparatus according to the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Considering the problem that the accuracy of far-field speech recognition is reduced due to factors such as environmental noise, the energy of the speech is attenuated due to the change of the distance, which also reduces the accuracy of far-field speech recognition. However, in an actual sound scene, not only the distance but also the change in the volume of the human voice at the time before and after the voice have an influence on the voice recognition accuracy.

However, for the speech recognition model, it is generally required to extract speech features first, and then input the speech features into the training model for training the speech recognition model.

When the method is implemented, the following method can be adopted to extract the features: acquiring continuous voice data, pre-emphasizing the acquired voice data, framing the pre-emphasized voice data, windowing the framed voice data, performing FFT (fast Fourier transform) on the windowed voice data, and filtering the voice data through a MEL (melt echo enhancer) filter bank to obtain voice characteristics.

Specifically, in order to make the accuracy of the speech recognition model obtained by training the extracted speech features higher, the speech features may be compressed after filtering the speech data, for example, the speech features may be obtained by processing in the following two ways:

1) the Filter-Bank voice features are extracted, and as shown in fig. 1, after the voice data is filtered by the MEL Filter Bank, the voice features after passing through the MEL Filter Bank are compressed to a range convenient to process by Log operation.

However, the resolution of the simple Log operation for audio features with low energy is low, which results in loss of information of the voice data.

2) Extracting PCEN (per-channel energy normalization) speech features, where the PCEN speech feature extraction process may include: and statically extracting PCEN voice features and dynamically extracting the PCEN voice features.

As shown in fig. 2, compared with the extraction of the filter-bank speech feature, the static extraction of the PCEN speech feature replaces the Log operation with the PCEN operation, where the PCEN operation formula can be expressed as:

M(t,f)＝(1-s)M(t-1,f)+sE(t,f)

where E (t, f) represents filterbank energy per time-frequency block, M (t, f) represents intermediate smoothing energy, s represents smoothing coefficient, α, δ, r, ∈ represents preset parameters, which can be determined empirically, for example, s is 0.025, α is 0.98, δ is 2, r is 0.5, and ∈ 0.000001.

The dynamic PCEN voice features are extracted, the PCEN can be set as one layer in a neural network, and the purpose of effectively improving the accuracy of the obtained voice features is achieved through learning parameters in a PCEN operation formula. In implementation, it can be understood as a processing mode of an approximate FIR filter, that is, parameters in a calculation formula are specified, and there are no feedback and no transformation. Specifically, a plurality of sets s may be set so as to obtain a plurality of sets of intermediate smoothing energies M_i(t, f) and then weighting these intermediate smoothed energies to obtain the final M (t, f). Specifically, the PCEN formula can be expressed as:

M_k(t,f)＝(1-s_k)M_i(t-1,f)+s_kE(t,f)

wherein s is_kMay be a predetermined parameter value, z_k(f) The parameters may be learned parameters, and other parameters may be preset or learned parameters, which are not limited in the present application.

However, for the above-mentioned extracted dynamic PCEN speech features, only the influence of the frequency on the intermediate smooth energy is considered, however, in the actual voice pickup process, not only the distance and the frequency affect the recognition accuracy, but also the speaker can affect the accuracy of the speech recognition if the speaker speaks loudly first and then loudly, or speaks loudly first and then loudly, that is, the volumes of the preceding and following utterances are different, that is, the time also affects the accuracy of the speech recognition.

Therefore, in the example, if the influence factor of time is added in the dynamic PCEN voice feature extraction process, the recognition accuracy can be effectively improved. Specifically, time dimension information can be introduced, so that the influence of time on the identification accuracy can be reduced to a certain extent.

FIG. 3 is a flow chart of a method of one embodiment of a speech recognition model training method described herein. Although the present application provides method operational steps or apparatus configurations as illustrated in the following examples or figures, more or fewer operational steps or modular units may be included in the methods or apparatus based on conventional or non-inventive efforts. In the case of steps or structures which do not logically have the necessary cause and effect relationship, the execution sequence of the steps or the module structure of the apparatus is not limited to the execution sequence or the module structure described in the embodiments and shown in the drawings of the present application. When the described method or module structure is applied in an actual device or end product, the method or module structure according to the embodiments or shown in the drawings can be executed sequentially or executed in parallel (for example, in a parallel processor or multi-thread processing environment, or even in a distributed processing environment).

Specifically, as shown in fig. 3, a method for training a speech recognition model according to an embodiment of the present application may include the following steps:

step 301: acquiring voice features after filtering processing, wherein the voice features are extracted from voice data;

specifically, the following method can be adopted to extract the speech features: acquiring continuous voice data, pre-emphasizing the acquired voice data, framing the pre-emphasized voice data, windowing the framed voice data, performing FFT (fast Fourier transform) on the windowed voice data, and filtering the voice data through a MEL (melt echo enhancer) filter bank to obtain voice characteristics.

Step 302: performing band energy normalization on the voice features through time dimension information and frequency dimension information of the voice data;

considering that in the actual voice pickup process, not only the distance and the frequency affect the recognition accuracy, if the speaker speaks loudly first and then speaks loudly, or speaks loudly first and then speaks loudly, that is, the volume of the preceding and following utterances is different, the accuracy of the voice recognition will also be affected, that is, the time will also affect the accuracy of the voice recognition. Therefore, time dimension information can be introduced to carry out band energy regularization on the voice features.

Specifically, the band energy normalization of the voice feature according to the time dimension information and the frequency dimension information of the voice data may include:

s1: determining a time-influencing parameter;

s2: weighting the intermediate smooth energy of the previous moment and the energy of the time frequency block of the current moment through the time influence parameter to obtain the intermediate smooth energy of the current moment;

s3: and performing band energy normalization on the voice features according to the middle smooth energy of the current moment.

Step 303: and training the voice recognition model according to the voice characteristics obtained after the band energy is regulated.

In the above example, the determining the time influence parameter may be obtaining a band energy normalization result at a previous time, and then calculating to obtain the time influence parameter according to the band energy normalization result at the previous time; or, obtaining the band energy regularization result of the previous time and the energy of the time frequency block of the current time, and calculating to obtain the time influence parameter according to the band energy regularization result of the previous time and the energy of the time frequency block of the current time.

The following description is provided by referring to a specific formula of the calculated time-affecting parameter, however, it should be noted that the listed formula is only an exemplary description, and the present application is not limited thereto.

The time-influencing parameter, which may also be referred to as an input gate, may be calculated according to one of the following equations:

1)i_t(t,f)＝σ(W_ir*PCEN(t-1,f)+bias)

2)i_t(t,f)＝σ(W_ir*PCEN(t-1,f)+W_ie*log(E(t,f))+bias)

3)i_t(t,f)＝σ(W_ir*PCEN(t-1,f)+W_ie*(log(E(t,f))-E_M(f))+bias)

4)i_t(t,f)＝σ(W_ir*PCEN(t-1,f)+W_ie*(log(E(t,f))-log(M(t-1,f)))+bias)

wherein i_t(t, f) represents a time-influencing parameter for weighting E (t, f) and M (t-1, f), W_irThe band energy warping result PCEN (t-1, f) representing the previous moment is returned to the weighting factor, W, of the time-influencing parameter connected to the current moment_ieA weight coefficient representing the energy of the time-frequency block at the current moment connected to a time-influencing parameter at the current moment, bias, σ () representing a sigmoid function, which is a sigmoid function common in biology and is also called a sigmoid growth curve,. denotes a matrix multiplication, t denotes time, f denotes frequency,. denotes a dot multiplication, E (t, f) denotes the energy of the time-frequency block at the current moment, and_Mrepresents the mean of the log (E (t, f)) counted by global data, which can be fixed or learned during training.

After the time-influencing parameter is calculated according to the above formula, PCEN may be calculated according to the following formula:

M(t,f)＝(1-i_t(t,f))·M(t-1,f)+i_t(t,f)·E(t,f)

in the above example, the band energy normalization may be used as a layer in the neural network acoustic model, that is, the band energy normalization (which may be referred to as Gated-recovery-PCEN) may be used as a band energy normalization layer in the training model of the speech recognition model to train the speech recognition model.

As shown in fig. 4, a schematic diagram of Gated-recovery-PCEN as one layer in the neural network acoustic model is shown, where BLSTM (Bidirectional Long Short-term Memory based neural network) may represent 1 or more BLSTM hidden layers, DNN may represent 1 or more DNN layers, and BLSTM + DNN is a typical speech recognition acoustic model structure, that is, in this example, a Gated-recovery-PCEN (band energy regularization) layer is inserted between the input and BLSTM, and parameters of the Gated-recovery-PCEN may be adjusted along with training of the network.

In the above example, time dimension information is introduced due to the feedback-based band energy warping (input gate i)_t(t, f)) on the filter coefficient, that is, the band energy is normalized by adopting a mode similar to an FIR filter, so that the performance loss can be effectively reduced compared with an IIR filter, and particularly, when the data volume is large, the performance can be effectively improved.

The effect of the above method is described below with reference to a set of actual experimental results, in this example, the recorded real far-field test data is used as a test set, where the test set includes 1000 pieces of real recorded far-field data, and the distance is 1m to 5m, including: music, human voice interference and other environmental noises. Based on this, the results shown in table 1 below were obtained:

TABLE 1

Speech feature extraction method	Test data (word error Rate%)
		General Log filter-bank	36
Static PCEN	33.7
		Dynamic PCEN	28.4
Gated-Recurrent-PCEN	26.5

As can be seen from table 1, the band energy normalization using the method in this example can bring about a reduction in the word error rate of about 7%.

The method can be used in any intelligent home such as a loudspeaker box, a television and the like, or a voice interaction system.

In the above example, it is considered that the accuracy of far-field speech recognition is greatly reduced compared with that of near-field speech recognition, and this is mainly because the distance can cause the speech energy to be greatly reduced, thereby causing the accuracy of speech recognition to be greatly reduced, and speech with too small energy usually has a mismatch with the recognition model to a great extent, thereby causing the accuracy of speech recognition to be reduced.

Specifically, the processing manner of the voice feature may be applied to the scenario shown in fig. 5, after a user sends out voice data, a sound receiving device (e.g., a smart speaker, a smart television, a conference transcription device, etc.) may pick up the voice data, and then transmit the picked-up voice data to a voice processing device (e.g., a processor, etc.) for processing, and after the processor acquires continuous voice data, the processor may process the voice data (e.g., pre-emphasizes the acquired voice data, performs framing processing on the pre-emphasized voice data, performs windowing processing on the framed voice data, performs FFT conversion on the windowed voice data, and filters the voice data through a MEL filter set, thereby obtaining the voice feature). After the voice feature is obtained, the method provided by the above example can be adopted, and the band energy of the voice feature is normalized through the time dimension information and the frequency dimension information of the voice data, so that the problem of the quality of the picked-up signal caused by a large amount of noise, multipath reflection and reverberation in the real environment is solved, and the final voice feature with the normalized band energy is obtained.

After the speech features with regular band energy are obtained, a speech recognition model can be called to perform speech recognition on the speech features, or the speech recognition model is trained based on the speech features, so that the recognition accuracy of the speech recognition model is higher. The specific application scenario is not limited in the present application, and may be selected according to actual needs, which is not limited in the present application.

Based on this, an embodiment of the present application further provides a far-field speech recognition method, which may include the following steps:

step 1: acquiring voice features after filtering processing, wherein the voice features are extracted from voice data;

Step 2: performing band energy normalization on the voice features through time dimension information and frequency dimension information of the voice data;

And step 3: and inputting the voice characteristics obtained after the band energy is regulated into a voice recognition model for voice recognition.

The specific data processing steps in the speech recognition method are similar to the data steps in the speech recognition model training method described above, and a description thereof is not repeated in this application.

Further, in an embodiment of the present application, a far-field speech recognition method is further provided, which may include the following steps:

s1: acquiring voice data;

s2: determining whether the voice data is far-field voice data;

s3: under the condition that the voice data are confirmed to be far-field voice data, the voice data are recognized through the voice recognition model obtained through the training of the voice recognition model training method.

Namely, the voice recognition model can be applied to voice recognition of remote voice data, and can effectively improve the recognition accuracy of far-field voice data.

The method provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal or a similar operation device. Taking the example of operating on the server side, fig. 6 is a block diagram of a hardware structure of a server of the speech recognition model training method according to the embodiment of the present invention. As shown in fig. 6, the server 10 may include one or more (only one shown) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission module 106 for communication functions. It will be understood by those skilled in the art that the structure shown in fig. 6 is only an illustration and is not intended to limit the structure of the electronic device. For example, the server 10 may also include more or fewer components than shown in FIG. 6, or have a different configuration than shown in FIG. 6.

The memory 104 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the speech recognition model training method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implementing the speech recognition model training method of the application program. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission module 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission module 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission module 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In the software aspect, the above speech recognition model training apparatus may be as shown in fig. 7, and includes: an acquisition module 701, a warping module 702, and a training module 703. Wherein:

an obtaining module 701, configured to obtain a voice feature after filtering, where the voice feature is extracted from voice data;

a warping module 702, configured to perform band energy warping on the voice feature according to the time dimension information and the frequency dimension information of the voice data;

the training module 703 is configured to train the speech recognition model according to the speech features obtained after the band energy is normalized.

In one embodiment, the warping module 702 may band-energy warp the speech features with time dimension information and frequency dimension information of the speech data according to the following steps:

s1: determining a time-influencing parameter;

In one embodiment, determining the temporal impact parameter may include: acquiring a frequency band energy regulation result at the previous moment; and calculating to obtain a time influence parameter according to the frequency band energy normalization result at the previous moment.

In one embodiment, determining the time-influencing parameter according to the band energy warping result of the previous time may include: multiplying the weight coefficient matrix by the frequency band energy regulation result at the previous moment to obtain a first result, wherein the weight coefficient is the weight coefficient of the time influence parameter of the previous moment connected back to the frequency band energy regulation result at the current moment; adding an offset to the first result to obtain a second result; and solving sigmoid for the second result to obtain the time influence parameter.

For example, the time-influencing parameter may be calculated according to the following formula:

i_t(t,f)＝σ(W_ir*PCEN(t-1,f)+bias)

wherein i_t(t, f) represents a time-influencing parameter, W_irThe band energy warping result PCEN (t-1, f) representing the previous moment returns to the weight coefficient of the time influence parameter connected to the current moment, bias represents bias, sigma () represents sigmoid function, matrix multiplication, t represents time, and f represents frequency.

In one embodiment, determining the temporal impact parameter may include: acquiring a frequency band energy regulation result at the previous moment and the energy of a time frequency block at the current moment; and calculating to obtain a time influence parameter according to the frequency band energy regulation result of the previous moment and the energy of the time frequency block of the current moment.

In one embodiment, the time-influencing parameter may be calculated according to one of the following equations according to the band energy warping result of the previous time instant and the energy of the time-frequency block of the current time instant:

i_t(t,f)＝σ(W_ir*PCEN(t-1,f)+W_ie*log(E(t,f))+bias)

i_t(t,f)＝σ(W_ir*PCEN(t-1,f)+W_ie*(log(E(t,f))-E_M(f))+bias)

i_t(t,f)＝σ(W_ir*PCEN(t-1,f)+W_ie*(log(E(t,f))-log(M(t-1,f)))+bias)

wherein i_t(t, f) represents a time-influencing parameter, W_irRepresenting the band energy warping result PCEN (t-1, f) of the previous time back to the weight coefficient of the time influencing parameter of the current time, bias represents bias, sigma () represents sigmoid function, matrix multiplication, t represents time, f represents frequency, E (t, f) represents the energy of the time frequency block of the current time, E (t, f) represents the energy of the time_MRepresents the mean of log (E (t, f)) counted by global data.

In one embodiment, band energy warping is used as a band energy warping layer in a training model of the speech recognition model to train the speech recognition model.

In one embodiment, the band energy warping layer may be located between the input of the training model and the two-way long-short term memory based neural network layer.

According to the speech recognition model training method and the server, the frequency band energy of the speech features after filtering processing is normalized through the time dimension information and the frequency dimension information of the speech data; and training the voice recognition model according to the voice characteristics obtained after the band energy is regulated. Because the time dimension information and the frequency dimension information are introduced in the process of regulating the frequency band energy, the influence of time and frequency on the speech recognition accuracy can be weakened, and the technical effect of effectively improving the recognition accuracy of the speech recognition model is achieved.

Although the present application provides method steps as described in an embodiment or flowchart, additional or fewer steps may be included based on conventional or non-inventive efforts. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or client product executes, it may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures.

The apparatuses or modules illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. The functionality of the modules may be implemented in the same one or more software and/or hardware implementations of the present application. Of course, a module that implements a certain function may be implemented by a plurality of sub-modules or sub-units in combination.

The methods, apparatus or modules described herein may be implemented in computer readable program code to a controller implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, Application Specific Integrated Circuits (ASICs), programmable logic controllers and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

Some of the modules in the apparatus described herein may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary hardware. Based on such understanding, the technical solutions of the present application may be embodied in the form of software products or in the implementation process of data migration, which essentially or partially contributes to the prior art. The computer software product may be stored in a storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. All or portions of the present application are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, mobile communication terminals, multiprocessor systems, microprocessor-based systems, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

While the present application has been described with examples, those of ordinary skill in the art will appreciate that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.

Claims

1. A far-field speech recognition method, comprising:

acquiring voice features after filtering processing, wherein the voice features are acquired voice data extracted from voice data;

determining whether the voice data is far-field voice data;

2. The method of claim 1, further comprising:

and training a voice recognition model according to the voice characteristics obtained after the band energy is normalized to obtain the voice recognition model.

3. The method of claim 2, wherein band energy warping the speech features with time dimension information and frequency dimension information of the speech data comprises:

determining a time-influencing parameter;

weighting the intermediate smooth energy of the previous moment and the energy of the time frequency block of the current moment through the time influence parameter to obtain the intermediate smooth energy of the current moment;

and performing band energy normalization on the voice features according to the middle smooth energy of the current moment.

4. The method of claim 3, wherein determining a temporal impact parameter comprises:

acquiring a frequency band energy regulation result at the previous moment;

and calculating to obtain a time influence parameter according to the frequency band energy normalization result at the previous moment.

5. The method of claim 4, wherein determining a time-influencing parameter according to the band energy warping result of the previous time comprises:

multiplying the weight coefficient matrix by the frequency band energy regulation result at the previous moment to obtain a first result, wherein the weight coefficient is the weight coefficient of the time influence parameter of the previous moment connected back to the frequency band energy regulation result at the current moment;

adding an offset to the first result to obtain a second result;

and solving sigmoid for the second result to obtain the time influence parameter.

6. The method of claim 3, wherein determining a temporal impact parameter comprises:

acquiring a frequency band energy regulation result at the previous moment and the energy of a time frequency block at the current moment;

and calculating to obtain a time influence parameter according to the frequency band energy regulation result of the previous moment and the energy of the time frequency block of the current moment.

7. The method according to any of claims 1 to 6, wherein the speech recognition model is trained with band energy warping as a band energy warping layer in a training model of the speech recognition model.

8. The method of claim 7, wherein the band energy leveling layer is located between the input of the training model and the layer based on the bi-directional long-short term memory neural network.

9. A method for training a speech recognition model, comprising: the method comprises the following steps:

10. A far-field speech recognition device comprising a processor and a memory for storing processor-executable instructions that, when executed by the processor, implement the method of any of claims 1 to 8.

11. A model training server comprising a processor and a memory for storing processor-executable instructions that when executed by the processor implement the method of claim 9.

12. A computer readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 8.