CN110797008B

CN110797008B - Far-field voice recognition method, voice recognition model training method and server

Info

Publication number: CN110797008B
Application number: CN201810775407.XA
Authority: CN
Inventors: 薛少飞
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2024-03-29
Anticipated expiration: 2038-07-16
Also published as: CN110797008A; WO2020015546A1

Abstract

The application provides a far-field voice recognition method, a voice recognition model training method and a server, wherein the far-field voice recognition method comprises the following steps: acquiring voice data; determining whether the speech data is far-field speech data; and under the condition that the voice data are far-field voice data, the voice data are identified through a voice identification model, wherein the voice identification model is obtained by training voice features obtained by performing band energy gauge on the voice features of the voice data according to time dimension information and frequency dimension information of the voice data. By utilizing the technical scheme provided by the embodiment of the application, because the time dimension information and the frequency dimension information are introduced in the process of adjusting the frequency band energy, the influence of time and frequency on the voice recognition accuracy can be weakened, the remote voice recognition is performed based on the voice recognition model, the recognition accuracy can be effectively improved, and the technical effect of effectively improving the recognition accuracy of the voice recognition model is achieved.

Description

Far-field voice recognition method, voice recognition model training method and server

Technical Field

The application belongs to the technical field of Internet, and particularly relates to a far-field voice recognition method, a voice recognition model training method and a server.

Background

Far-field speech recognition is an important technology in the field of speech interaction, by which far-field speech can be recognized as far-distance sounds (e.g., speech within 1m to 5m can be recognized). Far-field speech recognition is mainly applied to the field of smart home, for example, can be applied to equipment such as smart speakers and smart televisions, and can also be applied to the fields such as conference transcription.

However, since there are a lot of noise, multipath reflection, reverberation, and other interference problems in a real environment, there is a general problem, resulting in degradation of the quality of the picked-up sound signal. For far-field speech recognition, the main cause of the degradation of recognition accuracy is the attenuation of speech energy due to distance.

How to effectively reduce the problem of high recognition accuracy of the voice model caused by voice energy attenuation has not yet been proposed.

Disclosure of Invention

The application aims to provide a far-field voice recognition method, a voice recognition model training method and a server, so as to achieve the purpose of improving the recognition accuracy of a voice recognition model.

The application provides a far-field speech recognition method, a speech recognition model training method and a server, which are realized in the following way:

a far field speech recognition method, comprising:

acquiring voice data;

determining whether the speech data is far-field speech data;

and under the condition that the voice data are far-field voice data, the voice data are identified through a voice identification model, wherein the voice identification model is obtained by training voice features obtained by performing band energy gauge on the voice features of the voice data according to time dimension information and frequency dimension information of the voice data.

A speech recognition model training method, comprising:

acquiring voice characteristics after filtering processing, wherein the voice characteristics are extracted from voice data;

performing band energy regulation on the voice characteristics through time dimension information and frequency dimension information of the voice data;

and training the voice recognition model according to the voice characteristics obtained after the band energy is regulated.

A model training server comprising a processor and a memory for storing processor-executable instructions, the processor implementing the following steps when executing the instructions:

A computer readable storage medium having stored thereon computer instructions which when executed perform the steps of the above method.

According to the far-field voice recognition method, the voice recognition model training method and the server, the frequency band energy of the voice characteristics after filtering is regulated through the time dimension information and the frequency dimension information of the voice data; and training the voice recognition model according to the voice characteristics obtained after the band energy is regulated. Because the time dimension information and the frequency dimension information are introduced in the process of adjusting the frequency band energy, the influence of time and frequency on the voice recognition accuracy can be weakened, the remote voice recognition is carried out based on the voice recognition model, the recognition accuracy can be effectively improved, and the technical effect of effectively improving the recognition accuracy of the voice recognition model is achieved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of a method for extracting Filter-Bank speech features;

FIG. 2 is a flow chart of a method of extracting static PECN speech features;

FIG. 3 is a method flow diagram of a speech recognition model training method provided herein;

FIG. 4 is a schematic illustration of a scenario of speech feature determination provided herein;

FIG. 5 is a schematic illustration of a training model provided herein;

FIG. 6 is a schematic diagram of the architecture of a model training server provided herein;

fig. 7 is a block diagram of a speech recognition model training apparatus provided in the present application.

Detailed Description

In order to better understand the technical solutions in the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

Considering the problem that the far-field speech recognition accuracy is reduced due to factors other than environmental noise, the far-field speech recognition accuracy is reduced due to the fact that the speech energy is attenuated due to the change of the distance. However, in an actual sound scene, not only the distance will affect the speech recognition accuracy, but also the change in the volume of the human voice at the front and rear time will affect the speech recognition accuracy.

However, for a speech recognition model, it is generally necessary to first extract speech features and then input the speech features into a training model for training of the speech recognition model.

When implemented, features may be extracted as follows: continuous voice data are acquired, pre-emphasis is carried out on the acquired voice data, framing processing is carried out on the pre-emphasis voice data, windowing processing is carried out on the framed voice data, FFT conversion is carried out on the windowed voice data, and filtering is carried out on the voice data through an MEL filter bank, so that voice characteristics are obtained.

Specifically, in order to make the accuracy of the speech recognition model obtained by training the extracted speech features higher, the speech features may be subjected to compression processing after filtering the speech data, for example, the speech features may be obtained by processing in the following two ways:

1) The Filter-Bank voice characteristics are extracted, and as shown in fig. 1, after voice data is filtered by the MEL Filter Bank, the voice characteristics after passing through the MEL Filter Bank are compressed to a range convenient to process by Log operation.

However, simple Log operations have a relatively low resolution for low energy audio features, resulting in loss of information from the speech data.

2) The PCEN (per-channel energy normalization, band energy management) voice characteristics are extracted, and the PCEN voice characteristic extraction flow can comprise: static PCEN voice features extraction and dynamic PCEN voice features extraction.

As shown in fig. 2, compared with extracting filter-bank speech features, the static PCEN speech feature extraction replaces Log operation with PCEN operation, where the formula of PCEN operation may be expressed as:

M(t,f)＝(1-s)M(t-1,f)+sE(t,f)

where E (t, f) represents the filebank energy of each time-frequency block, M (t, f) represents the intermediate smoothing energy, s represents the smoothing coefficient, α, δ, r, ∈represents a predetermined parameter, and these parameter values may be determined empirically, for example, may be set to: s=0.025, α=0.98, δ=2, r=0.5, e=0.000001. It should be noted, however, that the setting of the parameters in the above examples is only an exemplary description, and other values may be used in actual implementation.

The dynamic PCEN voice feature is extracted, the PCEN can be set as one layer in the neural network, and the purpose of effectively improving the accuracy of the obtained voice feature is achieved by learning parameters in a middle PCEN operation formula. When implemented, it is understood that the approach of approximating the FIR filter is adopted, i.e., the parameters in the calculation formula are specified, feedback free, and transform free. In particular, a plurality of sets s can be set to obtain a plurality of sets of intermediate smoothed energies M _i (t, f) and then weighting these intermediate smoothed energies to obtain the final M (t, f). Specifically, the PCEN operation formula may be expressed as:

M _k (t,f)＝(1-s _k )M _i (t-1,f)+s _k E(t,f)

wherein s is _k May be a preset parameter value, z _k (f) The parameter may be a parameter obtained by learning, and the other parameters may be preset or obtained by learning, which is not limited in this application.

However, for the above-mentioned extraction of dynamic PCEN speech features, only the influence of frequency on intermediate smoothing energy is considered, however, in the actual sound pickup process, not only the distance and frequency have an influence on the recognition accuracy, but also the speaker will have an influence on the accuracy of speech recognition if speaking aloud, i.e. the volumes of speaking front and back are different, i.e. the time will have an influence on the accuracy of speech recognition.

Therefore, in this example, it is considered that in the process of extracting the dynamic PCEN voice feature, if the influencing factor of time is added, the accuracy of recognition can be effectively improved. In particular, time dimension information may be introduced, so that the impact of time on recognition accuracy may be reduced to some extent.

FIG. 3 is a method flow diagram of one embodiment of a speech recognition model training method described herein. Although the present application provides a method operation step or apparatus structure as shown in the following examples or figures, more or fewer operation steps or module units may be included in the method or apparatus based on routine or non-inventive labor. In the steps or structures where there is no necessary causal relationship logically, the execution order of the steps or the module structure of the apparatus is not limited to the execution order or the module structure shown in the drawings and described in the embodiments of the present application. The described methods or module structures may be implemented sequentially or in parallel (e.g., in a parallel processor or multithreaded environment, or even in a distributed processing environment) in accordance with the embodiments or the method or module structure connection illustrated in the figures when implemented in a practical device or end product application.

As shown in fig. 3, the method for training a speech recognition model according to an embodiment of the present application may include the following steps:

step 301: acquiring voice characteristics after filtering processing, wherein the voice characteristics are extracted from voice data;

specifically, the following manner may be adopted to extract the voice features: continuous voice data are acquired, pre-emphasis is carried out on the acquired voice data, framing processing is carried out on the pre-emphasis voice data, windowing processing is carried out on the framed voice data, FFT conversion is carried out on the windowed voice data, and filtering is carried out on the voice data through an MEL filter bank, so that voice characteristics are obtained.

Step 302: performing band energy regulation on the voice characteristics through the time dimension information and the frequency dimension information of the voice data;

considering that in the actual sound pickup process, not only the distance and frequency have an influence on the recognition accuracy, but also the speaker speaks loudly and then speaks loudly if speaking loudly first or speaking loudly first and speaking loudly, that is, the volume of front and back speaking is different, which has an influence on the accuracy of the voice recognition, that is, the time has an influence on the accuracy of the voice recognition. Therefore, the time dimension information can be introduced to perform band energy normalization on the voice features.

Specifically, performing band energy adjustment on the voice feature through time dimension information and frequency dimension information of the voice data may include:

s1: determining a time-affecting parameter;

s2: weighting the intermediate smoothing energy of the previous moment and the energy of the time-frequency block of the current moment through the time influence parameter to obtain the intermediate smoothing energy of the current moment;

s3: and according to the middle smooth energy of the current moment, performing band energy regulation on the voice characteristic.

Step 303: and training the voice recognition model according to the voice characteristics obtained after the band energy is regulated.

In the above example, the determining the time-affecting parameter may be obtaining a band energy gauge result of a previous time, and then calculating to obtain the time-affecting parameter according to the band energy gauge result of the previous time; or, acquiring the energy of the time-frequency block at the current time and the frequency band energy gauge result at the previous time, and calculating to obtain the time influence parameter according to the frequency band energy gauge result at the previous time and the energy of the time-frequency block at the current time.

The following describes a specific formula for calculating the time-dependent parameter, however, it should be noted that the calculation formula is only a schematic description, and the present application is not limited thereto.

The time-dependent parameter may be calculated according to one of the following formulas, which may also be referred to as an input gate:

1)i _t (t,f)＝σ(W _ir *PCEN(t-1,f)+bias)

2)i _t (t,f)＝σ(W _ir *PCEN(t-1,f)+W _ie *log(E(t,f))+bias)

3)i _t (t,f)＝σ(W _ir *PCEN(t-1,f)+W _ie *(log(E(t,f))-E _M (f))+bias)

4)i _t (t,f)＝σ(W _ir *PCEN(t-1,f)+W _ie *(log(E(t,f))-log(M(t-1,f)))+bias)

wherein i is _t (t, f) represents a time-influencing parameter for weighting E (t, f) and M (t-1, f), W _ir The band energy gauge PCEN (t-1, f) representing the previous instant returns to the weighting coefficient of the time-affecting parameter at the current instant, W _ie The energy of the time-frequency block representing the current time is connected to the weight coefficient of the time-influencing parameter of the current time, bias, sigma () represents the sigmoid function, which is a biologically usual sigmoid function, also called an S-shaped growth curve, x represents a matrix multiplication, t represents time, f represents time, andfrequency, represents a point multiplication, E (t, f) represents the energy of the time-frequency block at the current time, E _M Representing the mean of log (E (t, f)) counted by global data, the parameter may be fixed or may be learned during training.

After calculating the time-affecting parameter according to the above formula, the PCEN may be calculated according to the following formula:

M(t,f)＝(1-i _t (t,f))·M(t-1,f)+i _t (t,f)·E(t,f)

in the above example, the band energy gauge may be used as one layer of the acoustic model of the neural network, that is, the band energy gauge (may be referred to as a measured-current-PCEN) may be used as the band energy gauge in the training model of the speech recognition model to train the speech recognition model.

As shown in fig. 4, a schematic diagram of using a gate-current-PCEN as one layer in a neural network acoustic model, where a BLSTM (Bidirectional Long Short-term Memory based on a two-way long-short-term Memory neural network) may represent 1 or more layers of a BLSTM hidden layer, and DNN may represent 1 or more layers of DNN, where blstm+dnn is a typical speech recognition acoustic model structure, i.e., in this example, a gate-current-PCEN (band energy gauge) layer is inserted between an input and the BLSTM, and parameters of the gate-current-PCEN may be adjusted as the network trains.

In the above example, since the band energy gauge based on feedback is adopted, the time dimension information (input gate i _t (t, f)) on the filter coefficient, that is, the band energy is regulated by adopting a mode of approximating the FIR filter, the performance loss can be effectively reduced compared with the IIR filter, and particularly, when the data volume is large, the performance can be effectively improved.

The effect of the above method is described below with reference to a set of practical experimental results, in this example, recorded real far-field test data is used as a test set, where the test set includes 1000 real recorded far-field data, and the distance is 1m-5m, and includes: music, human voice interference, etc. Based on this, the results shown in the following table 1 were obtained:

TABLE 1

Speech feature extraction method	Test data (word error Rate%)
		General Log Filter-bank	36
Static PCEN	33.7
		Dynamic PCEN	28.4
Gated-Recurrent-PCEN	26.5

As can be seen from table 1, the band energy adjustment by the method of this example can reduce the word error rate by about 7%.

The method can be used in any intelligent home such as a sound box, a television, and the like, or a voice interaction system.

In the above example, considering that the accuracy of far-field speech recognition is greatly reduced compared with near-field speech recognition, this is mainly because the distance can lead to the speech energy to be greatly reduced, thereby leading to the speech recognition accuracy to be greatly reduced, the speech with too small energy generally has a larger degree of mismatch with the recognition model, thereby leading to the reduction of the speech recognition accuracy, in the practical application scene, the distance between a person and a radio microphone and the volume change of the person can lead to the attenuation of the speech energy to different degrees, therefore, in the example, the frequency band energy gauge is carried out through the time dimension information and the frequency dimension information to obtain the speech characteristics, thereby improving the recognition accuracy of the model obtained through final training.

Specifically, the processing manner of the voice features can be applied to the scene shown in fig. 5, after the user sends out voice data, the sound receiving device (for example, a smart speaker, a smart television, a conference transcription device and the like) can pick up the voice data, then the picked-up voice data is transferred to the voice processing device (for example, a processor and the like) for processing, after the processor obtains continuous voice data, the processor can process the voice data (for example, pre-emphasis is performed on the obtained voice data, framing is performed on the pre-emphasis voice data, windowing is performed on the framed voice data, FFT (fast Fourier transform) is performed on the windowed voice data, and filtering is performed on the voice data through the MEL filter bank, so that the voice features are obtained. After the voice feature is obtained, the voice feature can be subjected to band energy adjustment through the time dimension information and the frequency dimension information of the voice data in the mode provided by the above example, so that the problem of quality of a picked-up signal caused by a large amount of noise, multipath reflection and reverberation existing in a real environment is weakened, and the final band energy-adjusted voice feature is obtained.

After the voice characteristics with the band energy being regulated are obtained, the voice recognition model can be called, the voice characteristics are subjected to voice recognition, or the voice recognition model is trained based on the voice characteristics, so that the recognition accuracy of the voice recognition model is higher. The specific application scenario is not limited in this application, and may be selected according to actual needs, which is not limited in this application.

Based on this, an embodiment of the present application further provides a far-field speech recognition method that may include the steps of:

step 1: acquiring voice characteristics after filtering processing, wherein the voice characteristics are extracted from voice data;

Step 2: performing band energy regulation on the voice characteristics through the time dimension information and the frequency dimension information of the voice data;

Step 3: and inputting the voice characteristics obtained after the band energy is regulated into a voice recognition model for voice recognition.

The specific data processing steps in the speech recognition method are similar to those in the speech recognition model training method described above, and the description thereof will not be repeated.

Further, in the embodiment of the present application, there is also provided a far-field speech recognition method, which may include the following steps:

s1: acquiring voice data;

s2: determining whether the speech data is far-field speech data;

s3: and under the condition that the voice data is far-field voice data, recognizing the voice data by using the voice recognition model obtained by training by the voice recognition model training method.

That is, the voice recognition model can be applied to voice recognition of remote voice data, and recognition accuracy of far-field voice data can be effectively improved.

The method embodiments provided by the embodiments of the present application may be performed in a mobile terminal, a computer terminal, or similar computing device. Taking the operation on the server side as an example, fig. 6 is a block diagram of the hardware structure of a server of a speech recognition model training method according to an embodiment of the present invention. As shown in fig. 6, the server 10 may include one or more (only one is shown in the figure) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA), a memory 104 for storing data, and a transmission module 106 for communication functions. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 6 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the server 10 may also include more or fewer components than shown in FIG. 6, or have a different configuration than shown in FIG. 6.

The memory 104 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the speech recognition model training method in the embodiment of the present invention, and the processor 102 executes the software programs and modules stored in the memory 104 to perform various functional applications and data processing, that is, implement the speech recognition model training method of the application program. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission module 106 is used to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission module 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission module 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

At the software level, the speech recognition model training apparatus may, as shown in fig. 7, include: an acquisition module 701, a normalization module 702 and a training module 703. Wherein:

an obtaining module 701, configured to obtain a voice feature after the filtering process, where the voice feature is extracted from voice data;

a normalization module 702, configured to normalize the frequency band energy of the voice feature according to the time dimension information and the frequency dimension information of the voice data;

the training module 703 is configured to train the speech recognition model according to the speech features obtained after the band energy is calibrated.

In one embodiment, the normalization module 702 may normalize the band energy of the speech feature by time dimension information and frequency dimension information of the speech data according to the following steps:

s1: determining a time-affecting parameter;

In one embodiment, determining the time-affecting parameter may include: acquiring a band energy gauge adjustment result at the previous moment; and calculating to obtain a time influence parameter according to the frequency band energy regulation result at the previous moment.

In one embodiment, determining the time-affecting parameter according to the band energy adjustment result of the previous time may include: multiplying the weight coefficient matrix by a frequency band energy gauge adjustment result at the previous moment to obtain a first result, wherein the weight coefficient is a weight coefficient of a time influence parameter connected with the current moment by the frequency band energy gauge adjustment result at the previous moment; adding bias to the first result to obtain a second result; and solving a sigmoid for the second result to obtain the time influence parameter.

For example, the time-dependent parameter may be calculated according to the following formula:

i _t (t,f)＝σ(W _ir *PCEN(t-1,f)+bias)

wherein i is _t (t, f) represents a time-dependent parameter, W _ir The band energy gauge representing the previous instant results PCEN (t-1, f) back to the weight coefficient of the time-influencing parameter of the current instant, bias representing the bias, σ () representing the sigmoid function, matrix times, t representing time, f representing frequency.

In one embodiment, determining the time-affecting parameter may include: acquiring the band energy gauge result of the previous moment and the energy of a time-frequency block of the current moment; and calculating to obtain a time influence parameter according to the frequency band energy regulation result of the previous moment and the energy of the time-frequency block of the current moment.

In one embodiment, the time-affecting parameter may be calculated according to one of the following formulas according to the band energy adjustment result of the previous time and the energy of the time-frequency block of the current time:

i _t (t,f)＝σ(W _ir *PCEN(t-1,f)+W _ie *log(E(t,f))+bias)

i _t (t,f)＝σ(W _ir *PCEN(t-1,f)+W _ie *(log(E(t,f))-E _M (f))+bias)

i _t (t,f)＝σ(W _ir *PCEN(t-1,f)+W _ie *(log(E(t,f))-log(M(t-1,f)))+bias)

wherein i is _t (t, f) represents a time-dependent parameter, W _ir The band energy gauge PCEN (t-1, f) representing the previous time returns to the weighting coefficient of the time-affecting parameter of the current time, bias represents bias, sigma () represents sigmoid function, matrix multiplication, t represents time, f represents frequency, E (t, f) represents energy of the time-frequency block of the current time, E _M Represents the mean of log (E (t, f)) counted by the global data.

In one embodiment, the speech recognition model is trained by adjusting a band energy level as a band energy level in a training model of the speech recognition model.

In one embodiment, the band energy gauge layer may be located between the input of the training model and the two-way long and short term memory based neural network layer.

According to the voice recognition model training method and the server, the frequency band energy of the voice features after filtering is regulated through the time dimension information and the frequency dimension information of the voice data; and training the voice recognition model according to the voice characteristics obtained after the band energy is regulated. Because the time dimension information and the frequency dimension information are introduced in the process of adjusting the frequency band energy, the influence of time and frequency on the voice recognition accuracy can be weakened, and the technical effect of effectively improving the recognition accuracy of the voice recognition model is achieved.

Although the present application provides method operational steps as described in the examples or flowcharts, more or fewer operational steps may be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented by an actual device or client product, the instructions may be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment) as shown in the embodiments or figures.

The apparatus or module set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. For convenience of description, the above devices are described as being functionally divided into various modules, respectively. The functions of the various modules may be implemented in the same piece or pieces of software and/or hardware when implementing the present application. Of course, a module that implements a certain function may be implemented by a plurality of sub-modules or a combination of sub-units.

The methods, apparatus or modules described herein may be implemented in computer readable program code means and in any suitable manner, e.g., the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller can be regarded as a hardware component, and means for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

Some of the modules of the apparatus described herein may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

From the description of the embodiments above, it will be apparent to those skilled in the art that the present application may be implemented in software plus necessary hardware. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, or may be embodied in the implementation of data migration. The computer software product may be stored on a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., comprising instructions for causing a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to perform the methods described in various embodiments or portions of embodiments herein.

Various embodiments in this specification are described in a progressive manner, and identical or similar parts are all provided for each embodiment, each embodiment focusing on differences from other embodiments. All or portions of the present application can be used in a number of general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, mobile communication terminals, multiprocessor systems, microprocessor-based systems, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

Although the present application has been described by way of example, those of ordinary skill in the art will recognize that there are many variations and modifications of the present application without departing from the spirit of the present application, and it is intended that the appended claims encompass such variations and modifications without departing from the spirit of the present application.

Claims

1. A far-field speech recognition method, comprising:

acquiring voice characteristics after filtering, wherein the voice characteristics are acquired voice data extracted from voice data;

determining whether the speech data is far-field speech data;

under the condition that the voice data are far-field voice data, the voice data are identified through a voice identification model, wherein the voice identification model is obtained by training voice characteristics obtained by performing band energy gauge on the voice characteristics of the voice data according to time dimension information and frequency dimension information of the voice data;

the method for adjusting the frequency band energy of the voice features through the time dimension information and the frequency dimension information of the voice data comprises the following steps: determining a time-affecting parameter; weighting the intermediate smoothing energy of the previous moment and the energy of the time-frequency block of the current moment through the time influence parameter to obtain the intermediate smoothing energy of the current moment; according to the middle smooth energy of the current moment, carrying out band energy gauge on the voice characteristics;

wherein determining the time-affecting parameter comprises: acquiring a band energy gauge adjustment result at the previous moment; multiplying the weight coefficient matrix by a frequency band energy gauge adjustment result at the previous moment to obtain a first result, wherein the weight coefficient is a weight coefficient of a time influence parameter connected with the current moment by the frequency band energy gauge adjustment result at the previous moment; adding bias to the first result to obtain a second result; and solving a sigmoid for the second result to obtain the time influence parameter.

2. The method as recited in claim 1, further comprising:

performing band energy regulation on the voice characteristics through the time dimension information and the frequency dimension information of the voice data;

and training the voice recognition model according to the voice characteristics obtained after the band energy is regulated to obtain the voice recognition model.

3. A method according to claim 1 or 2, characterized in that the speech recognition model is trained by adjusting the band energy level as a band energy level in the training model of the speech recognition model.

4. A method according to claim 3, wherein the band energy gauge is located between the input of the training model and the two-way long and short term memory based neural network layer.

5. A method for training a speech recognition model, comprising:

training a voice recognition model according to the voice characteristics obtained after the band energy is regulated;

6. A far field speech recognition device comprising a processor and a memory for storing processor executable instructions, which when executed by the processor implement the method of any one of claims 1 to 4.

7. A model training server comprising a processor and a memory for storing processor-executable instructions, the processor implementing the method of claim 5 when executing the instructions.

8. A computer readable storage medium having stored thereon computer instructions which when executed implement the steps of the method of any of claims 1 to 4.