CN114242106A

CN114242106A - Voice processing method and device

Info

Publication number: CN114242106A
Application number: CN202010942560.4A
Authority: CN
Inventors: 褚伟; 胡云卿; 刘悦; 林军; 罗潇
Original assignee: CRRC Zhuzhou Institute Co Ltd
Current assignee: CRRC Zhuzhou Institute Co Ltd
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2022-03-25
Anticipated expiration: 2040-09-09

Abstract

The invention provides a voice processing method and a voice processing device. The voice processing method comprises the following steps: acquiring a voice signal acquired by a microphone; eliminating echo in the voice signal by using an echo elimination model to obtain an intermediate voice signal; and removing the noise signal in the intermediate voice signal by using a deep neural network model to obtain a voice instruction signal in the voice signal.

Description

Voice processing method and device

Technical Field

The present invention relates to the field of voice processing, and in particular, to a voice processing method and apparatus for a voice interactive system.

Background

The electric car is a common public transport passenger car, and comprises a rail electric car, a light rail electric car, a tramcar and the like. The existing rail electric cars, light rail electric cars and tramcars need special rail cooperation to realize operation, and the infrastructure construction and vehicle acquisition cost are high.

In order to solve the problem, the middle vehicle shores provide the electric vehicle capable of following the ground virtual track, and the novel electric vehicle cancels a steel rail and runs along the ground virtual track in a mode of rubber wheel bearing and steering of a steering wheel. The ground virtual track is flexibly arranged, and only the virtual track like a lane line needs to be drawn on the ground. This kind of novel trolley-bus need not to travel along the fixed track, greatly reduced capital construction cost, has huge operation advantage for the tram. Meanwhile, the novel electric car has the running characteristics of road right sharing and mixed traffic, so that the traffic system has the advantage of flexible organization in the aspects of ground lane arrangement and the like.

This novel trolley-bus cab has voice broadcast system and large-size screen display system. The two systems operate independently and do not interfere with each other. The voice broadcasting system is used for broadcasting scheduling instruction information and prompt information. The large screen display system is used for displaying information such as traction blockade, vehicle information, air conditioner state, tire pressure, battery capacity, fault record and the like. The large-screen display system is embedded with a microphone and a loudspeaker which are respectively used for picking up sound and outputting voice, and state information can be switched through the voice interaction system.

In order to guarantee driving safety, the attention of a driver is more focused on the road surface, and the large-screen display state information can be switched through voice interaction. However, due to the voice interference of the voice broadcasting system and the large screen display system, the voice received by the microphone not only includes the voice interaction instruction, but also includes the sound of the voice broadcasting system and the echo of the sound of the large screen display system, and even further includes the air conditioner noise of the cab.

The invention aims to provide a voice processing method and a voice processing device for solving echo and noise in a voice signal acquired by a microphone.

Disclosure of Invention

The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

According to an aspect of the present invention, there is provided a speech processing method including: acquiring a voice signal acquired by a microphone; eliminating echo in the voice signal by using an echo elimination model to obtain an intermediate voice signal; and removing the noise signal in the intermediate voice signal by using a deep neural network model to obtain a voice instruction signal in the voice signal.

Further, the canceling the echo in the speech signal by using the echo cancellation model of the far-end signal to obtain an intermediate speech signal includes: performing echo estimation on the sound source based on the echo by using the echo cancellation model to obtain an echo estimation value of the voice signal; and subtracting the echo estimate from the speech signal to obtain the intermediate speech signal.

Further, the echo in the speech signal includes echoes of a plurality of sound sources, the echo cancellation model includes a plurality of adaptive filters respectively corresponding to the sound sources, and the performing echo estimation by the echo cancellation model on the sound sources based on the echoes to obtain an echo estimation value includes: respectively carrying out echo estimation on the plurality of sound sources by adopting the plurality of adaptive filters to respectively obtain echo estimation values of the plurality of sound sources; and finding a sum of the echo estimation values of the plurality of sound sources as an echo estimation value of the voice signal.

Still further, the speech processing method further comprises: judging whether the voice signal comprises a voice instruction signal or not; and the echo estimation of the acoustic source based on the echo by using the echo cancellation model to obtain an echo estimation value further comprises: updating the plurality of adaptive filters with the plurality of sound sources in response to not including a voice instruction signal in the voice signal; and performing echo estimation on the plurality of sound sources by adopting a plurality of adaptive filters which are updated recently in response to the voice command signals included in the voice signals.

Further, the determining whether the voice signal includes the voice instruction signal includes: using said plurality ofCalculating a detection function from a sound source and a speech signal acquired by the microphone

Wherein r is_xd＝E[x(n)d(n)]＝R_xxh，

R_xx＝E[x(n)x^T(n)]X (n) is the sum of the sound sources, d (n) is the speech signal, R_xxIs the autocorrelation matrix of x (n), h is the echo path,

is the variance of the speech signal d (n),

as is the variance of the echo y (n),

is the variance of the noise signal s (n),

is the variance of the voice command signal v (n); responding to the detection function value being larger than or equal to a preset threshold value, and judging that the voice signal does not comprise a voice instruction signal; and responding to the detection function value smaller than the preset threshold value, and judging that the voice signal comprises a voice instruction signal.

Further, assuming that the plurality of sound sources are m sound sources, the plurality of filters are m filters corresponding to the m sound sources, m > 1, the updating the plurality of adaptive filters using the plurality of sound sources comprises: updating formulas with parameters

Updating an ith adaptive filter of the plurality of adaptive filters, wherein,

y (n) is the speech signal,

is the sum of the sound source signals of the m sound sources,

x_iand L is the sound source signal of the ith sound source in the m sound sources, mu is a step factor, mu is more than 0 and less than 2, and alpha is a protection coefficient.

Further, the deep neural network model comprises an input layer, a hidden layer and an output layer, and the removing noise in the intermediate speech signal by using the deep neural network model to obtain the speech instruction signal in the speech signal comprises: and inputting the intermediate voice signal as input voice to an input layer of the deep neural network model to obtain an output signal of the output layer as the voice instruction signal.

Still further, the speech processing method further comprises: constructing the deep neural network model

Wherein,

an output function of an ith neuron of any one of the hidden layer and the output layer,

to connect the weight parameter of the jth neuron of the l-1 th layer and the ith neuron of the l-1 th layer,

the activation function value of the j-th neuron of the l-1 th layer,

f (x) is a Sigmoid function,

is a bias parameter of the ith neuron of the l-th layer, M_l-1The number of the neurons of the l-1 layer is, the output function value of the ith neuron of the input layer is the ith input voice of the deep neural network model, and the activation function value of the ith neuron of the input layer is equal to the output function value of the ith neuron; and training the deep neural network model to obtain each weight parameter and each bias parameter of the neural network model.

Still further, the training the deep neural network model to obtain each weight parameter and each bias parameter of the neural network model comprises: collecting a pure voice command signal and a noise signal of an actual application environment; mixing the pure voice instruction signal with the noise signal to obtain a voice instruction signal with noise, wherein the pure voice instruction is a label value of the voice instruction signal with noise; inputting the voice instruction signal with noise as input voice to an input layer of the deep neural network model to obtain a predicted voice instruction signal which is output by the output layer and corresponds to the voice instruction signal with noise; and comparing the label value of the noisy speech instruction signal with the corresponding predicted speech instruction signal to update each weight parameter and each bias parameter of the deep neural network model.

Further, the comparing the tag value of the noisy speech instruction signal with the corresponding predicted speech instruction signal to update each weight parameter and each bias parameter of the deep neural network model comprises: determining a cost function value of a predicted voice instruction signal corresponding to the voice instruction signal with noise relative to a tag value thereof by adopting a mean square error algorithm; and continuously updating each weight parameter and each bias parameter of the deep neural network model by using a back propagation process based on the cost function value and adopting a random gradient descent algorithm.

According to another aspect of the present invention, there is also provided a speech processing apparatus including: a memory for storing a computer program; and a processor coupled to the memory for executing the computer program on the memory, the processor configured to: acquiring a voice signal acquired by a microphone; eliminating echo in the voice signal by using an echo elimination model of a far-end signal to obtain an intermediate voice signal; and removing the noise signal in the intermediate voice signal by using a deep neural network model to obtain a voice instruction signal in the voice signal.

Still further, the processor is further configured to: performing echo estimation on the sound source based on the echo by using the echo cancellation model to obtain an echo estimation value of the voice signal; and subtracting the echo estimate from the speech signal to obtain the intermediate speech signal.

Still further, the echo in the speech signal comprises echoes of a plurality of sound sources, the echo cancellation model comprises a plurality of adaptive filters corresponding to the plurality of sound sources, respectively, and the processor is further configured to: respectively carrying out echo estimation on the plurality of sound sources by adopting the plurality of adaptive filters to respectively obtain echo estimation values of the plurality of sound sources; and finding a sum of the echo estimation values of the plurality of sound sources as an echo estimation value of the voice signal.

Still further, the processor is further configured to: judging whether the voice signal comprises a voice instruction signal or not; updating the plurality of adaptive filters with the plurality of sound sources in response to not including a voice instruction signal in the voice signal; and performing echo estimation on the plurality of sound sources by adopting a plurality of adaptive filters which are updated recently in response to the voice command signals included in the voice signals.

Still further, the processor is further configured to: calculating a detection function using the plurality of sound sources and the voice signal collected by the microphone

Wherein r is_xd＝E[x(n)d(n)]＝R_xxh，

is the variance of the speech signal d (n),

as is the variance of the echo y (n),

is the variance of the noise signal s (n),

Further, assuming that the plurality of sound sources are m sound sources, the plurality of filters are m filters corresponding to the m sound sources, m > 1, the processor is further configured to: updating formulas with parameters

Updating an ith adaptive filter of the plurality of adaptive filters, wherein,

y (n) is the speech signal,

is the sum of the sound source signals of the m sound sources,

Still further, the deep neural network model includes an input layer, a hidden layer, and an output layer, the processor further configured to: and inputting the intermediate voice signal as input voice to an input layer of the deep neural network model to obtain an output signal of the output layer as the voice instruction signal.

Still further, the processor is further configured to: constructing the deep neural network model

Wherein,

the activation function value of the j-th neuron of the l-1 th layer,

f (x) is a Sigmoid function,

is a bias parameter of the ith neuron of the l-th layer, M_l-1The number of the neurons of the l-1 layer, the output function value of the ith neuron of the input layer is the ith input voice of the deep neural network model, and the activation function value of the ith neuron of the input layer is equal to the ith neuronThe output function value of the element; and training the deep neural network model to obtain each weight parameter and each bias parameter of the neural network model.

Still further, the processor is further configured to: collecting a pure voice command signal and a noise signal of an actual application environment; mixing the pure voice instruction signal with the noise signal to obtain a voice instruction signal with noise, wherein the pure voice instruction is a label value of the voice instruction signal with noise; inputting the voice instruction signal with noise as input voice to an input layer of the deep neural network model to obtain a predicted voice instruction signal which is output by the output layer and corresponds to the voice instruction signal with noise; and comparing the label value of the noisy speech instruction signal with the corresponding predicted speech instruction signal to update each weight parameter and each bias parameter of the deep neural network model.

Still further, the processor is further configured to: determining a cost function value of a predicted voice instruction signal corresponding to the voice instruction signal with noise relative to a tag value thereof by adopting a mean square error algorithm; and continuously updating each weight parameter and each bias parameter of the deep neural network model by using a back propagation process based on the cost function value and adopting a random gradient descent algorithm.

According to yet another aspect of the present invention, there is also provided a computer storage medium having a computer program stored thereon, the computer program when executed implementing the steps of the speech processing method of any of the above.

Drawings

The above features and advantages of the present disclosure will be better understood upon reading the detailed description of embodiments of the disclosure in conjunction with the following drawings.

FIG. 1 is a flow diagram illustrating a method of speech processing in one embodiment according to one aspect of the present invention;

FIG. 2 is a schematic diagram of voice interaction of a cab of a rail transit system, depicted in accordance with an aspect of the present invention;

FIG. 3 is a partial flow diagram of a speech processing method in one embodiment according to one aspect of the present invention;

FIG. 4 is a partial flow diagram of a speech processing method in one embodiment according to one aspect of the present invention;

FIG. 5 is a partial flow diagram of a speech processing method in one embodiment according to an aspect of the present invention;

FIG. 6 is a partial flow diagram of a method of speech processing in one embodiment according to one aspect of the present invention;

FIG. 7 is a partial flow diagram of a method of speech processing in one embodiment according to an aspect of the present invention;

FIG. 8 is a partial flow diagram of a method of speech processing in one embodiment according to an aspect of the present invention;

FIG. 9 is a partial flow diagram of a speech processing method in one embodiment according to an aspect of the present invention;

FIG. 10 is a block diagram of a speech processing apparatus according to another aspect of the present invention.

Detailed Description

The following description is presented to enable any person skilled in the art to make and use the invention and is incorporated in the context of a particular application. Various modifications, as well as various uses in different applications will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to a wide range of embodiments. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the practice of the invention may not necessarily be limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.

The reader's attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All the features disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

It is noted that, where used, further, preferably, still further and more preferably is a brief introduction to the exposition of the alternative embodiment on the basis of the preceding embodiment, the contents of the further, preferably, still further or more preferably back band being combined with the preceding embodiment as a complete constituent of the alternative embodiment. Several further, preferred, still further or more preferred arrangements of the belt after the same embodiment may be combined in any combination to form a further embodiment.

The invention is described in detail below with reference to the figures and specific embodiments. It is noted that the aspects described below in connection with the figures and the specific embodiments are only exemplary and should not be construed as imposing any limitation on the scope of the present invention.

According to one aspect of the invention, a voice processing method is provided, which can be used for processing input voice instructions of a voice interaction system.

The voice interaction system refers to a system for acquiring a voice instruction input by a user and generating a corresponding interaction action, for example, "Siri" on an apple mobile phone, an intelligent robot, an intelligent home, and the like. For a common voice interaction system, there is no complex application environment, and only environmental noise may exist in the background sound, so that the voice instruction sent by the user can be obtained after the environmental noise is removed. However, for the cab of the rail transit system, because the large-screen display system and the voice broadcast system can often send out voice messages, the voice messages can be collected by the voice interaction system through transmission together with voice instructions spoken by the user, and environmental noises such as air conditioner noises and the like can also be mixed, the processing of voice signals collected by the voice interaction system of the cab is more complex than that of a common voice interaction system.

In one embodiment, as shown in FIG. 1, the speech processing method 100 includes steps S110 to S130.

Wherein, step S110 is: and acquiring a voice signal collected by a microphone.

The voice signal refers to a mixed sound collected by a microphone, and specifically, the meaning of the voice signal is described by taking a sound propagation process of a cab of the rail transit system shown in fig. 2 as an example.

As shown in FIG. 2, the large screen display system converts the sound source x₁(n) playing in the driver's cab through a loudspeaker, and broadcasting the sound source x by the voice broadcasting system₂(n) playing in the cab through another loudspeaker, wherein the sounds played by the two loudspeakers respectively pass through the echo propagation path h₁And h₂Mixed echo y (n) is formed when the near end of the microphone is reached; ambient noise such as air conditioner noise can form a noise signal s (n) at the near end of the microphone; the voice command actually spoken by the user forms a voice command signal v (n) at the near end of the microphone. It is understood that the three sound signal echoes y (n), the noise signal s (n) and the voice command signal v (n) do not necessarily exist simultaneously, wherein the existence of the sound signal echo y (n) and the noise signal s (n) has a certain contingency, and the voice command signal v (n) exists only when the user speaks. Therefore, when the microphone collects signals, the collected voice signal d (n) may be the three sound signal echoes y (n), the noise signal s (n) and the voice command signalAny combination of numbers v (n).

Since the speech processing method 100 aims to extract an accurate speech command signal v (n), when there is a speech command signal v (n) in the collected speech signal d (n), it is default that there are two other acoustic signal echoes y (n) and a noise signal s (n) in the speech signal d (n), and then the collected speech signal d (n) is subjected to indifferent echo y (n) and noise signal s (n) to obtain the remaining speech command signal v (n). It can be understood by those skilled in the art that the default microphone collects the speech signal as a mixture of the three sounds during the speech processing, but the actually collected speech signal d (n) is not necessarily the mixture of the three sounds, and even if the actually collected speech signal d (n) is not the mixture of the three sounds, the process of speech processing and the final result of speech processing are not affected.

Although the present invention has been described with reference to the cab of fig. 2 as an example for the purpose of voice signal processing, it can be understood by those skilled in the art that the application environment to which the voice processing method 100 is applied is not limited to the existence of echoes of two sound sources, and the echoes collected by the microphone may include echoes of multiple sound sources and multiple environmental noises.

Step S120 is: and eliminating the echo signal in the voice signal by using an echo elimination model to obtain an intermediate voice signal.

The echo cancellation model is a model for estimating an echo collected by a microphone after a sound source generating an echo is played by a speaker by using the sound source. Thus, an echo cancellation model may be used to obtain an estimate of the echo y (n)

Reuse of

Echo y (n) in speech signal d (n) is removed.

Further, as shown in FIG. 3, step S120 can be embodied as steps S121-S122.

Wherein, step S121 is: the sound source based on the echo utilizes an echo cancellation model to carry out echo estimation to obtainEcho estimation to speech signals

Preferably, the echo cancellation model may be formed by an adaptive filter. An adaptive filter refers to a filter that updates parameters and structure of the filter using an adaptive algorithm according to a change in environment. The echo cancellation model may then be constructed using a filter that does not change structure but that is updated with filter coefficients by an adaptive algorithm.

Suppose there are multiple sound sources x₁(n)～x_m(n) (m is an integer greater than 1), the echo cancellation model includes a corresponding adaptive filter ω₁～ω_m. Correspondingly, as shown in FIG. 4, the step S121 can be embodied as steps S1211 to S1212.

Step S1211 is: using the plurality of adaptive filters omega₁～ω_mRespectively for the plurality of sound sources x₁(n)～x_m(n) performing echo estimation to obtain echo estimation values of the plurality of sound sources, respectively

Wherein the echo estimated value of the ith (i is more than or equal to 1 and less than or equal to m) sound source is

Step S1212 is: determining a plurality of sound sources x₁(n)～x_m(n) corresponding echo estimate

Summed as an echo estimate of the speech signal d (n)

Namely, it is

It will be appreciated that the example of the use of the cab of the rail transit system shown in fig. 2 has two sound sources x₁(n) and x₂(n) and thus corresponding with two adaptive filters omega₁And ω₂The two adaptive filters are respectively used for the sound source x₁(n) and x₂(n) performing echo estimation to finally obtain the echo estimation value of the cab

Further, step S122 is: subtracting the echo estimate from the speech signal d (n)

To obtain an intermediate speech signal d' (n), i.e.

Further, in a more preferred embodiment, the adaptive filters ω₁～ω_mThe filter parameters of (a) can be continuously updated by using the voice signal collected by the microphone when no voice command signal exists. Specifically, as shown in FIG. 5, the speech processing method 100 further includes steps S140-150.

Wherein, step S140 is: judging whether the voice signal d (n) collected by the microphone includes a voice command signal v (n).

In particular, a plurality of sound sources x can be utilized₁(n)～x_m(n) and the voice signal d (n) collected by the microphone to construct a detection function xi, and using the detection function value to judge whether the voice signal d (n) includes the voice command signal v (n).

In one embodiment, the constructed detection function is as follows:

wherein r is_xd＝E[x(n)d(n)]＝R_xxh，

R_xx＝E[x(n)x^T(n)]X (n) is the plurality of sound sources x₁(n)～x_m(n) is the sum of

d (n) is a speech signal collected by a microphone, R_xxIs the autocorrelation matrix of x (n), h is the echo path,

is the variance of the speech signal d (n),

as is the variance of the echo y (n),

can be represented by sound sources, i.e.

Is the variance of the noise signal s (n),

is the variance of the voice command signal v (n).

Will r is_xd＝R_xxh、

And

substituting into the detection function (1), then equation (1) can be transformed into:

as can be seen from equation (2), when the speech signal d (n) includes only the echo y (n), the detection function value is equal to 1, and when the speech signal d (n) includes the echo y (n), the noise signal s (n) and the speech command signal v (n), the calculated detection function value is obviously smaller than 1. Therefore, the above-constructed detection function can be used to determine whether the voice signal d (n) includes the voice command signal v (n).

Further, as shown in fig. 6, the step S140 may include steps S141 to S143.

Step S141 is: using a plurality of sound sources x₁(n)～x_m(n) and a voice signal d (n) collected by a microphone to calculate a detection function xi. That is, x (n) and d (n) are substituted into formula (1) or formula (2) to calculate the corresponding detection function value.

Step S142 is: and responding to the detection function value being larger than or equal to a preset threshold value, and judging that the voice signal does not comprise a voice instruction signal.

Step S143 is: and responding to the condition that the detection function value is smaller than the preset threshold value, and judging that the voice signal comprises a voice instruction signal.

The preset threshold value can be set to be slightly less than 1, and when the calculated detection function value is less than the preset threshold value, the voice signal d (n) can be judged to comprise a voice command signal v (n); when the calculated detection function value is greater than or equal to the preset threshold value, it can be determined that the voice signal d (n) does not include the voice command signal v (n).

Further, step S150 is: in response to the voice signals d (n) collected by the microphones not including the voice command signal v (n), utilizing the plurality of sound sources x₁(n)～x_m(n) updating the plurality of adaptive filters ω₁～ω_m。

It will be appreciated that the adaptive filter omega may be adapted to more closely approximate the echo estimate to the true echo₁～ω_mThe filtering parameters may be continuously updated based on the speech signal d (n) and the previously filtered residual signal. Specifically, the ith adaptive filter ω_iThe update formula of (c) can be as follows:

wherein,

is the sum of the sound source signals of the m sound sources, i.e.

x_iFor the sound source signal of the ith sound source in the m sound sources, L is the filter length, mu is the step factor, 0 < mu < 2, and alpha is the protection coefficient. The protection coefficient alpha is used for preventing inner product | x (n) of the sound source x (n) from generating no wind²Too small, which results in a decrease in filter stability, may be set to a small fraction, such as 0.0001.

Then, step S1211 is preferably configured to: in response to the collected voice signals d (n) including the voice command signal v (n) by the microphone, the adaptive filters updated recently are adopted to the sound sources x₁(n)～x_m(n) performing echo estimation to obtain echo estimation values of the plurality of sound sources, respectively

That is, when it is detected that the voice signal includes the voice command signal, the adaptive filter including the filter parameters determined in the last update process is acquired without updating the filter parameters of the adaptive filter (the adaptive filter ω generated in the last executed step S150)₁～ω_m) To perform echo estimation.

Further, after removing the echo y (n) in the speech signal d (n), it is necessary to remove the noise s (n) in the speech signal d (n). Correspondingly, step S130 is: and removing the noise signal in the intermediate voice signal by using a deep neural network model to obtain a voice instruction signal in the voice signal.

The deep neural network model is a deep learning-based neural network model and comprises an input layer, a hidden layer and an output layer, wherein the hidden layer can comprise a plurality of layers. The neurons of each layer may be constructed separately first and then trained using a deep learning algorithm to obtain the weights and biases for each neuron of each layer.

Further, as shown in fig. 7, the step S130 may include steps S131 to S132.

Step S131 is: and constructing a deep neural network model.

Assuming that the deep neural network model has L layers in common, wherein the hidden layer comprises L-2(L is more than 2) layers, the input layer and the output layer are respectively 1 layer, and the number of neurons of any layer L (1 is more than L and less than or equal to L) layer in the L layers is M_lI (i is more than 1 and less than or equal to M) of the ith layer_l) The output function of each neuron is

Wherein,

the activation function value of the j-th neuron of layer l-1, i.e.

Is the bias parameter of the ith neuron of the ith layer. In addition, the ith (1 < i.ltoreq.M) of the first layer₁) Output function of individual neuron

The ith input voice of the input layer of the deep neural network model is input, and the ith (i is more than 1 and less than or equal to M) of the first layer₁) Activation function value of individual neuron

It will be appreciated that the activation function f (x) may be a Sigmoid function. The Sigmoid function is specifically expressed as follows:

f'(x)＝f(x)(1-f(x)) (5)

the input speech of the input layer is an amplitude spectrum obtained by converting actual speech by fourier transform. Correspondingly, the final amplitude spectrum is obtained after the output functions of all layers are denoised, and then the denoised actual voice can be obtained by performing inverse Fourier transform on the final amplitude spectrum.

Further, step S132 is: training the deep neural network model to obtain a weight parameter and a bias parameter of the neural network model.

The specific training process may be as shown in fig. 8, and step S132 may include steps S1321 to S1324.

Wherein, step S1321 is: and acquiring a pure voice command signal and a noise signal of an actual application environment.

Taking a cab of a rail transit system as an example of a practical application environment, noise is sound collected when echo and voice command signals do not exist in the cab. The clean voice command signal is the voice of the voice command collected in the environment without noise and echo. It is understood that the clean speech instruction signal used to train the deep neural network model may be any speech, and is not required to be the control command speech in the actual application.

Step S1322 is: and mixing the pure voice instruction signal with the noise signal to obtain a noisy voice instruction signal, wherein the pure voice instruction is a label value of the noisy voice instruction signal.

Step S1323 is: and inputting the voice command signal with noise as input voice to an input layer of the deep neural network model to obtain a predicted voice command signal which is output by the output layer and corresponds to the voice command signal with noise.

It can be understood that the tag value is a pure voice command actually corresponding to the noisy voice command signal, and therefore, the matching degree of the predicted voice command signal corresponding to the noisy voice command signal obtained by using the deep neural network model and the corresponding tag value can be used as a measurement index of the accuracy degree of the deep neural network model.

Step S1324 is: and comparing the label value of the noisy speech instruction signal with the corresponding predicted speech instruction signal to update the weight parameter and the bias parameter of the deep neural network model.

It will be appreciated that a cost function may be constructed to measure how well the tag value of the noisy speech command signal matches its corresponding predicted speech command signal and to update the weight parameters and bias parameters based on the match.

In one embodiment, as shown in FIG. 9, step S1324 may include steps S910-S920.

Step S910 is: and determining a cost function value of the predicted voice command signal corresponding to the noisy voice command signal relative to the label value thereof by using Mean-Square Error (MSE).

Then, the cost function is as follows:

wherein M is_LThe number of neurons in the output layer of the deep neural network model can be understood as the dimension of the output data, y_kThe tag value of the noisy speech command signal corresponding to the kth neuron,

the predicted speech command signal is a noisy speech command signal corresponding to the kth neuron.

It can be understood that a smaller cost function of the noisy speech command signal indicates a higher accuracy of the deep neural network model.

Step S920 is: and continuously updating the weight parameters and the bias parameters of the deep neural network model by using a back propagation process and a Stochastic Gradient Descent (SGD) algorithm based on the cost function values.

It can be appreciated that the training process of the deep neural network model is repeated until the accuracy requirement is met. Then, the jth neuron of the L-1 th layer is connected with the ith (1 < i < M) of the L (1 < L < L) th layer_l) Weight parameter of individual neuron

And the first layerBias parameters for the ith neuron

The specific back propagation update procedure of (2) may be as follows:

wherein,

in addition, the first and second substrates are,

eta is a proportionality coefficient and represents the learning rate of the deep neural network model.

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.

According to another aspect of the invention, a voice processing device is also provided, which is suitable for voice processing of the input voice command of the voice interaction system.

In one embodiment, as shown in FIG. 10, the speech processing apparatus 1000 includes a memory 1010 and a processor 1020.

The memory 1010 is used for storing computer programs.

The processor 1020 is connected to the memory 1010 for executing the computer program on the memory 1010, and the steps of the speech processing method 100 in any of the above embodiments are implemented when the processor 1020 executes the computer program on the memory 1010.

According to yet another aspect of the present invention, there is also provided a computer storage medium having a computer program stored thereon, wherein the computer program is configured to implement the steps of the speech processing method 100 in any of the above embodiments when executed.

Those of skill in the art would understand that information, signals, and data may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits (bits), symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The various illustrative logical modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. It is to be understood that the scope of the invention is to be defined by the appended claims and not by the specific constructions and components of the embodiments illustrated above. Those skilled in the art can make various changes and modifications to the embodiments within the spirit and scope of the present invention, and these changes and modifications also fall within the scope of the present invention.

Claims

1. A method of speech processing comprising:

acquiring a voice signal acquired by a microphone;

eliminating echo in the voice signal by using an echo elimination model to obtain an intermediate voice signal; and

and removing the noise signal in the intermediate voice signal by using a deep neural network model to obtain a voice instruction signal in the voice signal.

2. The speech processing method of claim 1 wherein said canceling the echo in the speech signal using the echo cancellation model for the far-end signal to obtain an intermediate speech signal comprises:

performing echo estimation on the sound source based on the echo by using the echo cancellation model to obtain an echo estimation value of the voice signal; and

subtracting the echo estimation value from the voice signal to obtain the intermediate voice signal.

3. The speech processing method of claim 2 wherein the echoes in the speech signal comprise echoes from a plurality of sound sources, wherein the echo cancellation model comprises a plurality of adaptive filters corresponding to the plurality of sound sources, respectively, and wherein the performing the echo estimation from the echo-based sound source using the echo cancellation model to obtain the echo estimation comprises:

respectively carrying out echo estimation on the plurality of sound sources by adopting the plurality of adaptive filters to respectively obtain echo estimation values of the plurality of sound sources; and

and calculating the sum of the echo estimation values of the sound sources to be used as the echo estimation value of the voice signal.

4. The speech processing method of claim 3, further comprising:

judging whether the voice signal comprises a voice instruction signal or not; and

the step of performing echo estimation on the sound source based on the echo by using the echo cancellation model to obtain an echo estimation value further comprises:

updating the plurality of adaptive filters with the plurality of sound sources in response to not including a voice instruction signal in the voice signal; and

and performing echo estimation on the plurality of sound sources by adopting a plurality of adaptive filters which are updated recently in response to the voice command signals included in the voice signals.

5. The speech processing method of claim 4, wherein the determining whether the speech signal includes a speech instruction signal comprises:

calculating a detection function using the plurality of sound sources and the voice signal collected by the microphone

Wherein r is_xd＝E[x(n)d(n)]＝R_xxh，

is the variance of the speech signal d (n),

as is the variance of the echo y (n),

is the variance of the noise signal s (n),

is the variance of the voice command signal v (n);

responding to the detection function value being larger than or equal to a preset threshold value, and judging that the voice signal does not comprise a voice instruction signal; and

and responding to the condition that the detection function value is smaller than the preset threshold value, and judging that the voice signal comprises a voice instruction signal.

6. The speech processing method of claim 4, wherein assuming that the plurality of sound sources are m sound sources, the plurality of filters are m filters corresponding to the m sound sources, m > 1, and updating the plurality of adaptive filters using the plurality of sound sources comprises:

updating formulas with parameters

Updating an ith adaptive filter of the plurality of adaptive filters, wherein,

y (n) is the speech signal,

is the sum of the sound source signals of the m sound sources,

7. The speech processing method of claim 1 wherein the deep neural network model comprises an input layer, an implicit layer, and an output layer, and wherein removing noise in the intermediate speech signal using the deep neural network model to obtain the speech command signal in the speech signal comprises:

and inputting the intermediate voice signal as input voice to an input layer of the deep neural network model to obtain an output signal of the output layer as the voice instruction signal.

8. The speech processing method of claim 7, further comprising:

constructing the deep neural network model

Wherein,

the activation function value of the j-th neuron of the l-1 th layer,

f (x) is a Sigmoid function,

is a bias parameter of the ith neuron of the l-th layer, M_l-1The number of the neurons of the l-1 layer is, the output function value of the ith neuron of the input layer is the ith input voice of the deep neural network model, and the activation function value of the ith neuron of the input layer is equal to the output function value of the ith neuron; and

training the deep neural network model to obtain each weight parameter and each bias parameter of the neural network model.

9. The method of speech processing according to claim 8 wherein said training the deep neural network model to derive each weight parameter and each bias parameter of the neural network model comprises:

collecting a pure voice command signal and a noise signal of an actual application environment;

mixing the pure voice instruction signal with the noise signal to obtain a voice instruction signal with noise, wherein the pure voice instruction is a label value of the voice instruction signal with noise;

inputting the voice instruction signal with noise as input voice to an input layer of the deep neural network model to obtain a predicted voice instruction signal which is output by the output layer and corresponds to the voice instruction signal with noise; and

and comparing the label value of the noisy speech instruction signal with the corresponding predicted speech instruction signal to update each weight parameter and each bias parameter of the deep neural network model.

10. The speech processing method of claim 9, wherein said comparing the tag value of the noisy speech instruction signal with its corresponding predicted speech instruction signal to update each weight parameter and each bias parameter of the deep neural network model comprises:

determining a cost function value of a predicted voice instruction signal corresponding to the voice instruction signal with noise relative to a tag value thereof by adopting a mean square error algorithm; and

and continuously updating each weight parameter and each bias parameter of the deep neural network model by using a back propagation process and a random gradient descent algorithm based on the cost function value.

11. A speech processing apparatus comprising:

a memory for storing a computer program; and

a processor coupled to the memory for executing the computer program on the memory, the processor configured to:

acquiring a voice signal acquired by a microphone;

eliminating echo in the voice signal by using an echo elimination model of a far-end signal to obtain an intermediate voice signal; and

12. The speech processing apparatus of claim 11, wherein the processor is further configured to:

13. The speech processing apparatus of claim 12 wherein the echoes in the speech signal comprise echoes from a plurality of sound sources, the echo cancellation model comprises a plurality of adaptive filters corresponding to the plurality of sound sources, respectively, the processor being further configured to:

14. The speech processing apparatus of claim 13, wherein the processor is further configured to:

judging whether the voice signal comprises a voice instruction signal or not;

15. The speech processing apparatus of claim 14, wherein the processor is further configured to:

Wherein r is_xd＝E[x(n)d(n)]＝R_xxh，

is the variance of the speech signal d (n),

as is the variance of the echo y (n),

is the variance of the noise signal s (n),

is the variance of the voice command signal v (n);

16. The speech processing device of claim 14 wherein, assuming that the plurality of sound sources are m sound sources, the plurality of filters are m filters corresponding to the m sound sources, m > 1, the processor is further configured to:

updating formulas with parameters

Updating an ith adaptive filter of the plurality of adaptive filters, wherein,

y (n) is the speech signal,

is the sum of the sound source signals of the m sound sources,

17. The speech processing apparatus of claim 11 wherein the deep neural network model comprises an input layer, a hidden layer, and an output layer, the processor further configured to:

18. The speech processing apparatus of claim 17, wherein the processor is further configured to:

constructing the deep neural network model

Wherein,

the activation function value of the j-th neuron of the l-1 th layer,

f (x) is a Sigmoid function,

19. The speech processing device of claim 18 wherein the speech processing device is adapted to a cab of a rail transit vehicle, the processor further configured to:

20. The speech processing apparatus of claim 19, wherein the processor is further configured to:

21. A computer storage medium having a computer program stored thereon, wherein the computer program when executed implements the steps of a speech processing method according to any of claims 1-10.