CN112530453B

CN112530453B - Voice recognition method and device suitable for noise environment

Info

Publication number: CN112530453B
Application number: CN202011355810.0A
Authority: CN
Inventors: 余翠琳; 周文略; 陈家聪; 梁艳阳; 王天雷; 冯伟霞; 秦传波; 翟懿奎; 朱翠娥; 刘始匡; 黎繁胜; 蒋润锦; 张俊亮
Original assignee: Zhixiang Technology Jiangmen Co ltd; Wuyi University
Current assignee: Zhixiang Technology Jiangmen Co ltd; Wuyi University
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2022-04-05
Anticipated expiration: 2040-11-27
Also published as: CN112530453A

Abstract

The invention provides a voice recognition method suitable for a noise environment, which comprises the following steps: receiving to-be-processed noisy conversation voice uploaded by voice acquisition equipment; carrying out voice enhancement processing on the dialogue voice with noise to be processed so as to extract voice characteristics in the dialogue voice with noise to be processed; searching a target voice recognition parameter set value of the voice with noise to be processed from the voice recognition parameter set according to the voice characteristics; and sending the target voice recognition parameter set value to the voice acquisition equipment so that the voice acquisition equipment performs voice recognition on the received voice data according to the target voice recognition parameter set value. The implementation of the invention can improve the input of voice signals through the microphone in various noisy environments and realize high-precision automatic voice recognition.

Description

Voice recognition method and device suitable for noise environment

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method and apparatus suitable for use in a noise environment.

Background

Since the 21 st century, the voice recognition research of China is rapidly developed, a considerable part of excellent enterprises emerge, the world leading level is reached in some fields, and products with high market share are produced, for example, translation machines of the science news carrier are widely applied to the people who travel abroad. Currently, acoustic models based on deep neural networks have significantly improved the performance of speech recognition, especially under near-field conditions. However, in practical applications, far-field and reverberant speech recognition remains a challenging problem.

Robust speech recognition is a common concern in the field of signal processing and speech recognition in a practical application environment, which is one of the most challenging tasks in the last decades. One major reason is that the target speech is contaminated with various background noises. At present, consumer electronics equipped with microphone arrays (e.g., car navigation devices and headsets) typically employ gradient method based speech enhancement techniques to cope with additive noise. However, while these techniques were originally developed for voice communications and can maximize signal-to-noise ratio (SDR), they do not always maximize Automatic Speech Recognition (ASR) accuracy.

Therefore, it is desirable to provide a method that maximizes the accuracy of a given automatic speech recognition by automatically adjusting the front-end speech enhancement function. The method can input voice through the microphone of the helmet when the environmental noise changes, high-precision automatic voice recognition is achieved through the algorithm, and command issuing of various functions of the helmet is completed. Genetic Algorithms (GA) are used to generate parameter values for front-end speech enhancement for a particular environment. By clustering the environments in advance based on the noise characteristics, the generated values can be dynamically assigned to the input speech signal.

Disclosure of Invention

In view of the foregoing problems in the prior art, an object of the present invention is to provide a speech recognition method suitable for use in a noise environment, including:

receiving to-be-processed noisy conversation voice uploaded by voice acquisition equipment;

carrying out voice enhancement processing on the to-be-processed noisy dialogue voice so as to extract voice characteristics in the to-be-processed noisy dialogue voice;

searching a target voice recognition parameter set value of the to-be-processed noisy dialogue voice from a voice recognition parameter set according to the voice characteristics;

and sending the target voice recognition parameter set value to the voice acquisition equipment so that the voice acquisition equipment performs voice recognition on the received voice data according to the target voice recognition parameter set value.

Further, the method for constructing the speech recognition parameter value set comprises the following steps:

acquiring a plurality of groups of noisy conversational speech under different noise environments, a speech recognition parameter value set and sentence texts corresponding to the noisy conversational speech;

clustering the multiple groups of noisy dialogue voices in different noise environments, and distributing an initial voice recognition parameter value for each type of noisy dialogue voice from the voice recognition parameter value set;

performing voice recognition on each type of the noisy dialogue voice by using the initial voice recognition parameter value, comparing a recognition result with sentence texts corresponding to the noisy dialogue voice, and adjusting each initial parameter candidate value according to the comparison result;

and obtaining the voice recognition parameter value set according to the adjusted initial parameter candidate value.

Further, the performing voice enhancement processing on the to-be-processed noisy conversational voice includes:

and carrying out voice enhancement processing on the to-be-processed noisy dialogue voice by utilizing a multi-channel wiener filter.

Further, the noisy conversational speech includes: coherent and incoherent sound sources;

the voice enhancement processing is carried out on the to-be-processed noisy dialogue voice by utilizing the multi-channel wiener filter, and the method comprises the following steps:

determining a transfer function of the coherent sound source and a transfer function of the incoherent sound source according to the noisy conversational speech;

processing the transfer function of the coherent sound source and the transfer function of the incoherent sound source by using a beam former to obtain the power spectral density of the coherent sound source and the power spectral density of the residual noise;

processing the power spectral density of the coherent sound source and the power spectral density of the residual noise by using a wiener post-filter to obtain the power spectral density of background noise;

and acquiring corresponding voice characteristics according to the power spectral density of the background noise.

In another aspect, the present invention provides a speech recognition method suitable for use in a noise environment, including:

collecting a to-be-processed noisy conversation voice;

uploading the to-be-processed noisy conversation voice to a terminal server so that the terminal server feeds back a target voice recognition parameter set value based on the noisy conversation voice;

and carrying out voice recognition on the noisy dialogue voice based on the received target voice recognition parameter set value.

In another aspect, the present invention provides a speech recognition apparatus suitable for use in a noisy environment, comprising:

the receiving module is configured to execute receiving of the to-be-processed noisy conversation voice uploaded by the voice acquisition equipment;

a voice feature extraction module configured to perform voice enhancement processing on the to-be-processed noisy speech so as to extract voice features in the to-be-processed noisy speech;

a searching module configured to perform searching for a target speech recognition parameter setting value of the to-be-processed noisy conversational speech from a speech recognition parameter value set according to the speech feature;

and the sending module is configured to execute sending the target voice recognition parameter set value to the voice acquisition equipment so that the voice acquisition equipment performs voice recognition on the received voice data according to the target voice recognition parameter set value.

Further, the search module comprises a parameter set setting module; the parameter set setting module includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire noisy dialogue voices, a voice recognition parameter value set and sentence texts corresponding to the noisy dialogue voices in multiple groups of different noise environments;

a clustering unit configured to perform clustering on the multiple groups of noisy conversational speech in different noise environments, and assign an initial speech recognition parameter value to each type of noisy conversational speech from the speech recognition parameter value set;

the adjusting unit is configured to perform voice recognition on each type of the noisy dialogue voice by using the initial voice recognition parameter value, compare a recognition result with a sentence text corresponding to the noisy dialogue voice, and adjust each initial parameter candidate value according to the comparison result;

and the parameter set acquisition unit is configured to acquire the voice recognition parameter value set according to the adjusted initial parameter candidate value.

the acquisition module is configured to acquire the to-be-processed noisy conversational speech;

an uploading module configured to upload the to-be-processed noisy conversation voice to a terminal server so that the terminal server feeds back a target voice recognition parameter setting value based on the noisy conversation voice;

a speech recognition module configured to perform speech recognition on the noisy conversational speech based on the received target speech recognition parameter setting value.

In another aspect, the present invention provides a computer-readable storage medium, in which at least one instruction or at least one program is stored, and the at least one instruction or the at least one program is loaded and executed by a processor to implement a speech recognition method suitable for use in a noise environment as described in any of the above.

In another aspect, the present invention provides a speech recognition device adapted for use in noisy environments, comprising at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the at least one processor implements a method of speech recognition suitable for use in noisy environments as described above by executing the instructions stored by the memory.

The voice recognition method and the voice recognition device which are suitable for the noise environment have the following beneficial effects:

the embodiments of the present description provide high performance automatic speech recognition for the dispenser by applying automatic parameter setting of front-end speech enhancement functions to the helmet's microphone capture signal. Since the delivery environment of the delivery person may have various noises, adjusting the parameter set according to the noise condition will improve the automatic speech recognition accuracy. Parameter setting value search and automatic speech recognition are run on the terminal server computer. If the accuracy of automatic speech recognition is low, their values can be searched again and updated to improve the quality of service. After the optimal parameter setting value is found, the optimal parameter setting value can be transmitted to an automatic voice recognition system of the helmet through the added port to realize real-time intelligent recognition. The parameter set selection is combined with the front-end voice enhancement function and is operated simultaneously with the voice enhancement function on the helmet, so that a distributor can realize voice recognition command control through a microphone.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description of the embodiment or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a multi-channel wiener filter according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a first speech recognition method suitable for use in a noise environment according to an embodiment of the present invention;

fig. 4 is a flowchart illustrating a fourth speech recognition method suitable for use in a noise environment according to an embodiment of the present invention;

fig. 5 is a schematic flowchart of a fifth speech recognition method suitable for use in a noise environment according to an embodiment of the present invention;

fig. 6 is a flowchart illustrating a sixth speech recognition method suitable for use in a noise environment according to an embodiment of the present invention;

fig. 7 is a flowchart illustrating a seventh speech recognition method suitable for use in a noise environment according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a speech recognition apparatus suitable for use in a noise environment according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of another speech recognition apparatus suitable for use in a noise environment according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a speech recognition apparatus suitable for use in a noise environment according to an embodiment of the present invention.

The voice recognition system comprises a receiving module 610, a voice feature extraction module 620, a searching module 630, a sending module 640, a collecting module 810, an uploading module 820 and a voice recognition module 830.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or device. In order to facilitate understanding of the technical solutions and the technical effects thereof described in the embodiments of the present specification, the embodiments of the present specification first explain related terms:

the frequency bin is the number following the decimal point, such as 810 in "72.810". This is achieved by using a dedicated "crystal oscillator" associated with the remote control device. The purpose of this function is to make many people use the remote control equipment simultaneously to distinguish and use different frequency points, and not to interfere with each other.

Power spectral density: in physics, signals are typically in the form of waves, such as electromagnetic waves, random vibrations, or acoustic waves. The power carried by a wave per unit frequency is obtained when the spectral density of the wave is multiplied by an appropriate coefficient, which is called the Power Spectral Density (PSD) or Spectral Power Distribution (SPD) of the signal. The unit of power spectral density is usually expressed in watts per hertz (W/Hz), or in wavelengths rather than frequencies, i.e. watts per nanometer (W/nm).

Referring to fig. 1 in the specification, a schematic diagram of an implementation environment provided by an embodiment of the present invention is shown, and as shown in fig. 1, the implementation environment may include at least a terminal server 110 and a voice collecting device 120.

It is understood that the voice capturing device 120 may communicate with the terminal server 110 in real time.

The terminal server 110 may be one or more smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. The voice capture device 120 may have a client installed therein, where the client may be an application program provided by the service provider to the user, or may be a web page provided by the service provider to the user. The speech acquisition device 120 may include a network communication unit, a processor, and memory, among others. The voice collecting device 120 can establish a communication connection with the terminal server 110 through a wireless or wired network. The wired connection mode can be a USB (universal serial bus), a serial port connection mode or a 422 interface connection mode, and the wireless connection mode can be a wireless local area network, a Bluetooth and/or a near field communication mode, and the like.

In the embodiment of the present invention, the client may be any client providing a service for a user. For example, the client may be a star observation type client, a payment type application client, a recruitment type client, a shopping type client, and the like.

An embodiment of the present specification provides a speech recognition method applicable to a noise environment, fig. 3 is a flowchart illustrating a first speech recognition method applicable to a noise environment according to an embodiment of the present invention, and as shown in fig. 3, an execution subject of the method may be a terminal server, and the method includes:

s102, receiving the to-be-processed noisy dialogue voice uploaded by the voice acquisition equipment.

In a specific implementation process, the voice acquisition device may be a device for acquiring a voice of a user, and as an application scenario of the embodiment of the present specification may be a microphone with voice acquisition worn by a takeaway.

And S104, performing voice enhancement processing on the to-be-processed noisy conversational voice to extract voice features in the to-be-processed noisy conversational voice.

In a specific implementation process, the speech enhancement processing performed on the to-be-processed noisy speech may be represented as noise reduction processing, and noise in the to-be-processed noisy speech is filtered to obtain a coherent sound source of the user.

And S106, searching the target voice recognition parameter set value of the to-be-processed noisy dialogue voice from the voice recognition parameter set according to the voice characteristics.

In a specific implementation process, fig. 4 is a schematic flow chart of a fourth speech recognition method applicable to a noise environment according to an embodiment of the present invention, and as shown in fig. 4, the method for constructing the speech recognition parameter value set includes:

s202, acquiring a plurality of groups of noisy dialogue voices, a voice recognition parameter value set and sentence texts corresponding to the noisy dialogue voices in different noise environments;

s204, clustering the multiple groups of noisy conversational voices in different noise environments, and distributing an initial voice recognition parameter value for each type of noisy conversational voice from the voice recognition parameter value set;

s206, performing voice recognition on each type of noisy dialogue voice by using the initial voice recognition parameter value, comparing a recognition result with sentence texts corresponding to the noisy dialogue voices, and adjusting each initial parameter candidate value according to the comparison result;

and S208, obtaining the voice recognition parameter value set according to the adjusted initial parameter candidate value.

In a specific implementation, the method for constructing the speech recognition parameter value set may be as follows, wherein the xi in (1) is used for the speech recognition parameter value set (parameter set)_iDenotes, wherein_iIs an index that distinguishes different values, ξ represents an element of the parameter set, and Nele is the number of such elements.

Assuming that the noise conditions are classified into N classes, the initial speech recognition parameter values will be initially adjusted as follows:

step 1) generating candidate initial speech recognition parameter values.

And 2) clustering the voice recognition parameter value set into subsets, and distributing initial voice recognition parameter values for the clusters.

The speech recognition parameter value set (data set) is composed of a speech signal captured by a microphone and correct sentence text. When an automatic speech recognition system of a speech acquisition device (helmet) requests to perform speech recognition, the best parameter setting value among the candidates will be automatically selected by measuring the distance between the noise and the cluster.

The speech recognition parameter value set is adjusted by clustering groups of noisy conversational speech in different noise environments and assigning an initial speech recognition parameter value to each cluster. Groups of noisy conversational speech in different noise environments are clustered in a feature space. By combining a filter bank G_FBApplication to

And using the time average, e.g. (2), to obtain

The characteristics of (1) wherein T, Ω_DFTAnd Ω_FBRespectively representing the frame number, the frame length and the filter bank channel number.

The centroid is determined by a local search to maximize the accuracy of the automatic speech recognition computed by all clusters.

Searching the target speech recognition parameter set value of the to-be-processed noisy conversational speech from the speech recognition parameter set according to the speech characteristics may be performed by a genetic algorithm, specifically,

an arrangement for generating speech recognition parameter value set candidates (data sets) is shown in fig. 1. Preparing candidate parameter setting values { xi ] of the corresponding number L of the noisy conversational speech under different noise environments_il}_il∈LWhere L represents a set of indices of size L. Searching for the mth initial parameter candidate may be represented as (3)

Here, J1 is an objective function corresponding to the/th subset of the data set, and it should be positively correlated with the automatic speech recognition accuracy. Each subset includes sound signals observed in various environments because the entire data set including the various environments is clustered. Therefore, the parameter setting value using the (3) search will be exclusively used for various environments.

Whereas the genetic algorithm is used to derive the suboptimal value xi for each L set of noise conditions. Genetic algorithm is one of the most commonly used meta-heuristic algorithms, which can solve the multimodal optimization problem, since it combines global and local searches. The real number encoding genetic algorithm is used because the elements xi are consecutive real numbers.

The objective function is called fitness in genetic algorithms. Parameter setting values for improving the accuracy of automatic speech recognition are searched. Therefore, fitness is a function of the character error rate of automatic speech recognition, using J in (4)_CER,lAnd (4) showing.

Here, power law scaling is employed, where igen denotes any real number. In the first stage of the generation number, the signal-to-distortion ratio is used as the fitness. In this case, the fitness is set to (5), and the data set must include a non-interfering signal as a reference for calculating the signal-to-distortion ratio in addition to the speech signal and the correct sentence text.

At the beginning of the search, the maximum values (xi) of the possible parameter settings are given^max) And a minimum value (xi)^min) And M initial parameter settings. To generate M offspring in one generation, M "pairs" of individuals are selected from the U existing individuals as parents. The possible logarithm is equal to the number of 2 combinations from the U element_UC₂And the possible number of M parents is equal to (A)_UC₂)^MSince the "pair" is selected repeatedly. The random selection algorithm is used to select individuals with higher fitness according to the closest best principle with higher probability, and the probability of selecting the parameter setting value i is determined by (6).

When the generation number transitions from the stage of using (5) to the stage of using (4), the manner of selection is different from random selection. In this case, the person in each generation that produces the highest signal-to-distortion ratio will be selected. Selecting individuals that are generated earlier can preserve diversity better than random selection because parameter settings that result in high signal-to-distortion ratios do not necessarily guarantee high automatic speech recognition accuracy.

BLX-a incorporates both crossover and mutation and is used to generate progeny. Selecting two parent individuals using random selection; their n-th elements are respectively represented as xi_i,nAnd xi_j,nAt intervals [ a, b]Wherein a and b are given by (7) and (8), and α is a coefficient defined in BLX-a to extend the interval.

a＝minξ_i(,n,ξ_j,n)-α{maxξ_i(,n,ξ_j,n)-minξ_i(,n,ξ_j,n)} (7)

b＝max(ξ_i,n,ξ_j,n)+α{max(ξ_i,n,ξ_j,n)-min(ξ_i,n,ξ_j,n)} (8)

Each descendant xi_k,nAre sampled from the truncated normal distribution (9) and not from the uniform distribution in the original BLX-a (10).

ξ_k,n～N(ξ_i,n,σ²,a,b) (9)

ξ_k,n～U(a,b) (10)

Interval [ a, b]Is expressed as U (a, b) and N (μ, σ) respectively₂A, b), where μ denotes the mean and σ 2 denotes the variance, as in (11).

σ²＝β{max(ξ_i,n,ξ_j,n)-min(ξ_i,n,ξ_j,n)} (11)

β in (11) is set so as to decrease as the number of generations increases. Thus, the search process tends to transition from global to local.

And S108, sending the target voice recognition parameter set value to the voice acquisition equipment so that the voice acquisition equipment performs voice recognition on the received voice data according to the target voice recognition parameter set value.

In a specific implementation process, after a target voice recognition parameter set value is determined at a server end, the target voice recognition parameter set value can be sent to corresponding voice acquisition equipment, so that the voice acquisition equipment performs voice recognition on received voice data according to the target voice recognition parameter set value.

The invention provides a novel automatic voice recognition system, which aims to improve the condition that voice signals are input through a microphone of a takeaway intelligent helmet in various noisy environments and realize high-precision automatic voice recognition. In the system, the parameter settings for front-end speech enhancement are adjusted by algorithmic optimization rather than empirically as in the prior conventional methods. Appropriate parameter setting values are generated in advance for each noise environment, and the optimum value thereof is automatically selected in accordance with the noise environment. The real number encoding genetic algorithm is used to search for parameter settings that maximize the accuracy of automatic speech recognition. The following algorithm is designed, so that the efficiency of the searching process is improved: 1) in earlier generations, fitness was set as a function of SDR; 2) the truncated normal distribution is used to generate offspring.

On the basis of the foregoing embodiment, in an embodiment of this specification, fig. 5 is a flowchart illustrating a fifth speech recognition method applied in a noise environment according to an embodiment of the present invention, and as shown in fig. 5, the performing speech enhancement processing on the to-be-processed noisy speech includes:

s302, carrying out voice enhancement processing on the to-be-processed noisy dialogue voice by utilizing a multi-channel wiener filter.

In a specific implementation process, a multi-channel wiener filter can be used for performing voice enhancement processing on the to-be-processed noisy conversational voice, and filtering an incoherent sound source in the noisy conversational voice to obtain a coherent sound source of a user.

On the basis of the foregoing embodiment, in an embodiment of this specification, fig. 6 is a flowchart illustrating a sixth speech recognition method applicable to a noisy environment according to an embodiment of the present invention, and as shown in fig. 6, the noisy speech includes: coherent and incoherent sound sources;

s402, determining a transfer function of the coherent sound source and a transfer function of the incoherent sound source according to the noisy conversational speech;

s404, processing the transfer function of the coherent sound source and the transfer function of the incoherent sound source by using a beam former to obtain the power spectral density of the coherent sound source and the power spectral density of the residual noise;

s406, processing the power spectral density of the coherent sound source and the power spectral density of the residual noise by using a wiener post-filter to obtain the power spectral density of background noise;

s408, acquiring corresponding voice characteristics according to the power spectral density of the background noise.

In a specific implementation process, the multi-channel wiener filter is the most important block, and the configuration of the multi-channel wiener filter is shown in fig. 2, where fig. 2 is a schematic diagram of a multi-channel wiener filter according to an embodiment of the present invention.

Let us assume that the sound of the dispenser arrives at the microphone from a known direction, wherein a coherent sound source is defined as (12), the first element of which is the user's voice. Wherein Y1 and Y each represent Y₁(ω, τ) and y (ω, τ).

s(ω,τ)＝[S₁(ω,τ)S₂(ω,τ)...S_Q(ω,τ)]^T (12)

The superscript T denotes transposition, ω and τ denote frequency bins and time ranges, respectively.

The signals received by the microphone are denoted by (13) and (14), where h_q(w) and w (ω, τ) are the transfer functions from the q-th coherent sound source to the microphone and incoherent background noise, respectively.

x(ω,τ)＝H(ω)s(ω,τ)+w(ω,τ) (13)

H(ω)＝[h₁(ω)h₂(ω)…h_Q(ω)] (14)

Minimum variance distortionless response beamformer G_BF(ω)For generating Y1 direction signals, as shown in (15) - (17), wherein

Is the background noise in the beamformer output.

G_BF(ω)＝[g_BF1(ω)g_BF2(ω)…g_BFR(ω)] (16)

Superscript H denotes hermitian transpose. If Y (ω, τ) is the first element, i.e. the target direction Y₁Signal at (ω, τ), Y₁(ω, τ) is regarded as the target component S₁The sum of (ω, τ) and the residual noise component V (ω, τ) is, in (18) and (19), under the constraint of a minimum variance distortionless response beamformer (20).

Y₁(ω,τ)＝S₁(ω,τ)+V(ω,τ) (18)

The power spectral density of the coherent sound source S (ω, τ) is described by (21), while the power spectral density of the residual noise V (ω, τ) is described by φ_VAnd (omega, tau) is shown.

In order to reduce the residual noise component V (ω, τ), wiener post-filter G is applied_Wiener(ω, τ) is applied to Y₁(ω, τ) is shown in (22) and (23).

Z(ω,τ)＝G_Winner(ω,τ)Y₁(ω,τ) (22)

In beam forming device

Is given by (24).

The relationship between the power spectral density of the acoustic source and the power spectral density of the beamformer output is established by a linear equationMode, as shown in (25), where D (w) is the directional gain of the beamformer, and G is calculated using the beamformer gain_BF(ω) and a transfer function H (ω).

The power spectral density of the background noise w (ω, τ) is approximately (26), where the superscript "+" represents the pseudo-inverse.

By using (27), (25) is deformed to about (28).

φ_S+W(ω,τ)＝D⁺(ω)φ_Y(ω,τ) (28)

By utilizing (28), one can estimate in (3) using (9) and (10), respectively

And phi_V(ω, τ), where γ is a weighting coefficient.

The power spectral density of the background noise is estimated using the minimum statistics to be (31).

Here, τ int is a time interval, and superscript denotes an exponential moving average for smoothing. In practical applications, the wiener post-filter is reshaped by a smoothing process.

On the other hand, an embodiment of the present specification provides a speech recognition method suitable for a noise environment, fig. 7 is a flowchart illustrating a seventh speech recognition method suitable for a noise environment according to an embodiment of the present invention, as shown in fig. 7, an execution subject of an embodiment of the present specification is a speech acquisition device, including:

and S502, collecting the dialogue voice with noise to be processed.

In a specific implementation process, the voice collecting device may collect the to-be-processed noisy conversational voice, which may include all sound sources corresponding to the collecting time, wherein the to-be-processed noisy conversational voice may include: coherent sound sources and incoherent sound sources. Coherent sound sources may be characterized as speech input by a user and incoherent sound sources may be characterized as background noise.

S504, uploading the to-be-processed noisy conversation voice to a terminal server, so that the terminal server feeds back a target voice recognition parameter set value based on the noisy conversation voice.

In a specific implementation process, the voice collecting device may be configured with a communication device for sending the collected to-be-processed noisy conversational voice to the terminal server.

S506, performing voice recognition on the noisy dialogue voice based on the received target voice recognition parameter set value.

In a specific implementation process, the speech recognition is a pattern recognition, and includes three basic units, such as feature extraction, pattern matching, and reference to a pattern library.

On the other hand, an embodiment of the present disclosure provides a speech recognition apparatus suitable for use in a noise environment, and fig. 8 is a schematic structural diagram of the speech recognition apparatus suitable for use in a noise environment according to an embodiment of the present disclosure, as shown in fig. 8, including:

the receiving module 610 is configured to perform receiving of the to-be-processed noisy conversational speech uploaded by the speech acquisition device;

a speech feature extraction module 620 configured to perform speech enhancement processing on the to-be-processed noisy conversational speech to extract speech features in the to-be-processed noisy conversational speech;

a searching module 630 configured to perform searching for a target speech recognition parameter setting value of the to-be-processed noisy conversational speech from a speech recognition parameter value set according to the speech feature;

a sending module 640 configured to execute sending the target voice recognition parameter setting value to the voice collecting device, so that the voice collecting device performs voice recognition on the received voice data according to the target voice recognition parameter setting value.

On the basis of the above embodiments, in an embodiment of the present specification, the search module includes a parameter set setting module; the parameter set setting module includes:

On the other hand, an embodiment of the present disclosure provides a speech recognition apparatus suitable for use in a noise environment, and fig. 9 is a schematic structural diagram of another speech recognition apparatus suitable for use in a noise environment according to an embodiment of the present disclosure, as shown in fig. 9, including:

an acquisition module 810 configured to perform acquisition of a to-be-processed noisy conversational speech;

an uploading module 820 configured to perform uploading the to-be-processed noisy conversation voice to a terminal server, so that the terminal server feeds back a target voice recognition parameter setting value based on the noisy conversation voice;

a speech recognition module 830 configured to perform speech recognition on the noisy conversational speech based on the received target speech recognition parameter setting value.

In another aspect, the present specification provides a computer-readable storage medium, in which at least one instruction or at least one program is stored, and the at least one instruction or the at least one program is loaded and executed by a processor to implement the method for speech recognition in a noise environment.

On the other hand, an embodiment of the present disclosure provides a speech recognition device suitable for use in a noisy environment, fig. 10 is a schematic structural diagram of a disk failure replacement device of a distributed storage system according to an embodiment of the present disclosure, as shown in fig. 10, including at least one processor and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the at least one processor implements a method of speech recognition suitable for use in noisy environments as described above by executing the instructions stored by the memory.

Since the speech recognition apparatus, the computer-readable storage medium, and the speech recognition device suitable for use in a noise environment have the same technical effects as the speech recognition method suitable for use in a noise environment, they are not described in detail herein.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The implementation principle and the generated technical effect of the testing method provided by the embodiment of the invention are the same as those of the system embodiment, and for the sake of brief description, the corresponding contents in the system embodiment can be referred to where the method embodiment is not mentioned.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above functions, if implemented in the form of software functional units and sold or used as a separate product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the above claims.

Claims

1. A method for speech recognition in a noisy environment, comprising:

sending the target voice recognition parameter set value to the voice acquisition equipment so that the voice acquisition equipment performs voice recognition on the received voice data according to the target voice recognition parameter set value;

the method for constructing the speech recognition parameter value set comprises the following steps:

2. The method of claim 1, wherein the performing speech enhancement processing on the noisy conversational speech to be processed comprises:

3. The method of claim 2, wherein the noisy conversational speech comprises: coherent and incoherent sound sources;

4. A method for speech recognition in a noisy environment, comprising:

collecting a to-be-processed noisy conversation voice;

uploading the to-be-processed noisy conversation voice to a terminal server so that the terminal server feeds back a target voice recognition parameter set value based on the noisy conversation voice, wherein the feeding back the target voice recognition parameter set value based on the noisy conversation voice comprises: carrying out voice enhancement processing on the to-be-processed noisy dialogue voice so as to extract voice characteristics in the to-be-processed noisy dialogue voice; searching a target voice recognition parameter set value corresponding to the to-be-processed noisy dialogue voice from a voice recognition parameter set according to the voice characteristics;

performing voice recognition on the noisy dialogue voice based on the received target voice recognition parameter set value;

5. A speech recognition apparatus adapted for use in noisy environments, comprising:

the sending module is configured to send the target voice recognition parameter set value to the voice acquisition equipment so that the voice acquisition equipment performs voice recognition on the received voice data according to the target voice recognition parameter set value;

wherein the search module comprises a parameter set setting module; the parameter set setting module includes:

6. A speech recognition apparatus adapted for use in noisy environments, comprising:

an uploading module configured to upload the to-be-processed noisy conversation voice to a terminal server so that the terminal server feeds back a target voice recognition parameter setting value based on the noisy conversation voice, wherein the feeding back the target voice recognition parameter setting value based on the noisy conversation voice comprises: carrying out voice enhancement processing on the to-be-processed noisy dialogue voice so as to extract voice characteristics in the to-be-processed noisy dialogue voice; searching a target voice recognition parameter set value corresponding to the to-be-processed noisy dialogue voice from a voice recognition parameter set according to the voice characteristics;

a speech recognition module configured to perform speech recognition on the noisy conversational speech based on the received target speech recognition parameter setting value;

7. A computer readable storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement a method of speech recognition adapted for use in noisy environments according to any of claims 1-3 or 4.

8. A speech recognition device adapted for use in noisy environments, comprising at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the at least one processor implements a method of speech recognition adapted for use in noisy environments as claimed in any one of claims 1-3 or 4 by executing the instructions stored by the memory.