CN114242106A - Voice processing method and device - Google Patents

Voice processing method and device Download PDF

Info

Publication number
CN114242106A
CN114242106A CN202010942560.4A CN202010942560A CN114242106A CN 114242106 A CN114242106 A CN 114242106A CN 202010942560 A CN202010942560 A CN 202010942560A CN 114242106 A CN114242106 A CN 114242106A
Authority
CN
China
Prior art keywords
signal
voice
echo
speech
sound sources
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010942560.4A
Other languages
Chinese (zh)
Other versions
CN114242106B (en
Inventor
褚伟
胡云卿
刘悦
林军
罗潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CRRC Zhuzhou Institute Co Ltd
Original Assignee
CRRC Zhuzhou Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CRRC Zhuzhou Institute Co Ltd filed Critical CRRC Zhuzhou Institute Co Ltd
Priority to CN202010942560.4A priority Critical patent/CN114242106B/en
Priority claimed from CN202010942560.4A external-priority patent/CN114242106B/en
Publication of CN114242106A publication Critical patent/CN114242106A/en
Application granted granted Critical
Publication of CN114242106B publication Critical patent/CN114242106B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention provides a voice processing method and a voice processing device. The voice processing method comprises the following steps: acquiring a voice signal acquired by a microphone; eliminating echo in the voice signal by using an echo elimination model to obtain an intermediate voice signal; and removing the noise signal in the intermediate voice signal by using a deep neural network model to obtain a voice instruction signal in the voice signal.

Description

Voice processing method and device
Technical Field
The present invention relates to the field of voice processing, and in particular, to a voice processing method and apparatus for a voice interactive system.
Background
The electric car is a common public transport passenger car, and comprises a rail electric car, a light rail electric car, a tramcar and the like. The existing rail electric cars, light rail electric cars and tramcars need special rail cooperation to realize operation, and the infrastructure construction and vehicle acquisition cost are high.
In order to solve the problem, the middle vehicle shores provide the electric vehicle capable of following the ground virtual track, and the novel electric vehicle cancels a steel rail and runs along the ground virtual track in a mode of rubber wheel bearing and steering of a steering wheel. The ground virtual track is flexibly arranged, and only the virtual track like a lane line needs to be drawn on the ground. This kind of novel trolley-bus need not to travel along the fixed track, greatly reduced capital construction cost, has huge operation advantage for the tram. Meanwhile, the novel electric car has the running characteristics of road right sharing and mixed traffic, so that the traffic system has the advantage of flexible organization in the aspects of ground lane arrangement and the like.
This novel trolley-bus cab has voice broadcast system and large-size screen display system. The two systems operate independently and do not interfere with each other. The voice broadcasting system is used for broadcasting scheduling instruction information and prompt information. The large screen display system is used for displaying information such as traction blockade, vehicle information, air conditioner state, tire pressure, battery capacity, fault record and the like. The large-screen display system is embedded with a microphone and a loudspeaker which are respectively used for picking up sound and outputting voice, and state information can be switched through the voice interaction system.
In order to guarantee driving safety, the attention of a driver is more focused on the road surface, and the large-screen display state information can be switched through voice interaction. However, due to the voice interference of the voice broadcasting system and the large screen display system, the voice received by the microphone not only includes the voice interaction instruction, but also includes the sound of the voice broadcasting system and the echo of the sound of the large screen display system, and even further includes the air conditioner noise of the cab.
The invention aims to provide a voice processing method and a voice processing device for solving echo and noise in a voice signal acquired by a microphone.
Disclosure of Invention
The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.
According to an aspect of the present invention, there is provided a speech processing method including: acquiring a voice signal acquired by a microphone; eliminating echo in the voice signal by using an echo elimination model to obtain an intermediate voice signal; and removing the noise signal in the intermediate voice signal by using a deep neural network model to obtain a voice instruction signal in the voice signal.
Further, the canceling the echo in the speech signal by using the echo cancellation model of the far-end signal to obtain an intermediate speech signal includes: performing echo estimation on the sound source based on the echo by using the echo cancellation model to obtain an echo estimation value of the voice signal; and subtracting the echo estimate from the speech signal to obtain the intermediate speech signal.
Further, the echo in the speech signal includes echoes of a plurality of sound sources, the echo cancellation model includes a plurality of adaptive filters respectively corresponding to the sound sources, and the performing echo estimation by the echo cancellation model on the sound sources based on the echoes to obtain an echo estimation value includes: respectively carrying out echo estimation on the plurality of sound sources by adopting the plurality of adaptive filters to respectively obtain echo estimation values of the plurality of sound sources; and finding a sum of the echo estimation values of the plurality of sound sources as an echo estimation value of the voice signal.
Still further, the speech processing method further comprises: judging whether the voice signal comprises a voice instruction signal or not; and the echo estimation of the acoustic source based on the echo by using the echo cancellation model to obtain an echo estimation value further comprises: updating the plurality of adaptive filters with the plurality of sound sources in response to not including a voice instruction signal in the voice signal; and performing echo estimation on the plurality of sound sources by adopting a plurality of adaptive filters which are updated recently in response to the voice command signals included in the voice signals.
Further, the determining whether the voice signal includes the voice instruction signal includes: using said plurality ofCalculating a detection function from a sound source and a speech signal acquired by the microphone
Figure BDA0002674128230000031
Wherein r isxd=E[x(n)d(n)]=Rxxh,
Figure BDA0002674128230000032
Rxx=E[x(n)xT(n)]X (n) is the sum of the sound sources, d (n) is the speech signal, RxxIs the autocorrelation matrix of x (n), h is the echo path,
Figure BDA0002674128230000033
is the variance of the speech signal d (n),
Figure BDA0002674128230000034
as is the variance of the echo y (n),
Figure BDA0002674128230000035
is the variance of the noise signal s (n),
Figure BDA0002674128230000036
is the variance of the voice command signal v (n); responding to the detection function value being larger than or equal to a preset threshold value, and judging that the voice signal does not comprise a voice instruction signal; and responding to the detection function value smaller than the preset threshold value, and judging that the voice signal comprises a voice instruction signal.
Further, assuming that the plurality of sound sources are m sound sources, the plurality of filters are m filters corresponding to the m sound sources, m > 1, the updating the plurality of adaptive filters using the plurality of sound sources comprises: updating formulas with parameters
Figure BDA0002674128230000037
Updating an ith adaptive filter of the plurality of adaptive filters, wherein,
Figure BDA0002674128230000038
y (n) is the speech signal,
Figure BDA0002674128230000039
is the sum of the sound source signals of the m sound sources,
Figure BDA00026741282300000310
xiand L is the sound source signal of the ith sound source in the m sound sources, mu is a step factor, mu is more than 0 and less than 2, and alpha is a protection coefficient.
Further, the deep neural network model comprises an input layer, a hidden layer and an output layer, and the removing noise in the intermediate speech signal by using the deep neural network model to obtain the speech instruction signal in the speech signal comprises: and inputting the intermediate voice signal as input voice to an input layer of the deep neural network model to obtain an output signal of the output layer as the voice instruction signal.
Still further, the speech processing method further comprises: constructing the deep neural network model
Figure BDA00026741282300000311
Wherein,
Figure BDA00026741282300000312
an output function of an ith neuron of any one of the hidden layer and the output layer,
Figure BDA00026741282300000313
to connect the weight parameter of the jth neuron of the l-1 th layer and the ith neuron of the l-1 th layer,
Figure BDA00026741282300000314
the activation function value of the j-th neuron of the l-1 th layer,
Figure BDA00026741282300000315
f (x) is a Sigmoid function,
Figure BDA00026741282300000316
is a bias parameter of the ith neuron of the l-th layer, Ml-1The number of the neurons of the l-1 layer is, the output function value of the ith neuron of the input layer is the ith input voice of the deep neural network model, and the activation function value of the ith neuron of the input layer is equal to the output function value of the ith neuron; and training the deep neural network model to obtain each weight parameter and each bias parameter of the neural network model.
Still further, the training the deep neural network model to obtain each weight parameter and each bias parameter of the neural network model comprises: collecting a pure voice command signal and a noise signal of an actual application environment; mixing the pure voice instruction signal with the noise signal to obtain a voice instruction signal with noise, wherein the pure voice instruction is a label value of the voice instruction signal with noise; inputting the voice instruction signal with noise as input voice to an input layer of the deep neural network model to obtain a predicted voice instruction signal which is output by the output layer and corresponds to the voice instruction signal with noise; and comparing the label value of the noisy speech instruction signal with the corresponding predicted speech instruction signal to update each weight parameter and each bias parameter of the deep neural network model.
Further, the comparing the tag value of the noisy speech instruction signal with the corresponding predicted speech instruction signal to update each weight parameter and each bias parameter of the deep neural network model comprises: determining a cost function value of a predicted voice instruction signal corresponding to the voice instruction signal with noise relative to a tag value thereof by adopting a mean square error algorithm; and continuously updating each weight parameter and each bias parameter of the deep neural network model by using a back propagation process based on the cost function value and adopting a random gradient descent algorithm.
According to another aspect of the present invention, there is also provided a speech processing apparatus including: a memory for storing a computer program; and a processor coupled to the memory for executing the computer program on the memory, the processor configured to: acquiring a voice signal acquired by a microphone; eliminating echo in the voice signal by using an echo elimination model of a far-end signal to obtain an intermediate voice signal; and removing the noise signal in the intermediate voice signal by using a deep neural network model to obtain a voice instruction signal in the voice signal.
Still further, the processor is further configured to: performing echo estimation on the sound source based on the echo by using the echo cancellation model to obtain an echo estimation value of the voice signal; and subtracting the echo estimate from the speech signal to obtain the intermediate speech signal.
Still further, the echo in the speech signal comprises echoes of a plurality of sound sources, the echo cancellation model comprises a plurality of adaptive filters corresponding to the plurality of sound sources, respectively, and the processor is further configured to: respectively carrying out echo estimation on the plurality of sound sources by adopting the plurality of adaptive filters to respectively obtain echo estimation values of the plurality of sound sources; and finding a sum of the echo estimation values of the plurality of sound sources as an echo estimation value of the voice signal.
Still further, the processor is further configured to: judging whether the voice signal comprises a voice instruction signal or not; updating the plurality of adaptive filters with the plurality of sound sources in response to not including a voice instruction signal in the voice signal; and performing echo estimation on the plurality of sound sources by adopting a plurality of adaptive filters which are updated recently in response to the voice command signals included in the voice signals.
Still further, the processor is further configured to: calculating a detection function using the plurality of sound sources and the voice signal collected by the microphone
Figure BDA0002674128230000051
Wherein r isxd=E[x(n)d(n)]=Rxxh,
Figure BDA0002674128230000052
Rxx=E[x(n)xT(n)]X (n) is the sum of the sound sources, d (n) is the speech signal, RxxIs the autocorrelation matrix of x (n), h is the echo path,
Figure BDA0002674128230000053
is the variance of the speech signal d (n),
Figure BDA0002674128230000054
as is the variance of the echo y (n),
Figure BDA0002674128230000055
is the variance of the noise signal s (n),
Figure BDA0002674128230000056
is the variance of the voice command signal v (n); responding to the detection function value being larger than or equal to a preset threshold value, and judging that the voice signal does not comprise a voice instruction signal; and responding to the detection function value smaller than the preset threshold value, and judging that the voice signal comprises a voice instruction signal.
Further, assuming that the plurality of sound sources are m sound sources, the plurality of filters are m filters corresponding to the m sound sources, m > 1, the processor is further configured to: updating formulas with parameters
Figure BDA0002674128230000057
Updating an ith adaptive filter of the plurality of adaptive filters, wherein,
Figure BDA0002674128230000058
Figure BDA0002674128230000059
y (n) is the speech signal,
Figure BDA00026741282300000510
is the sum of the sound source signals of the m sound sources,
Figure BDA00026741282300000511
xiand L is the sound source signal of the ith sound source in the m sound sources, mu is a step factor, mu is more than 0 and less than 2, and alpha is a protection coefficient.
Still further, the deep neural network model includes an input layer, a hidden layer, and an output layer, the processor further configured to: and inputting the intermediate voice signal as input voice to an input layer of the deep neural network model to obtain an output signal of the output layer as the voice instruction signal.
Still further, the processor is further configured to: constructing the deep neural network model
Figure BDA0002674128230000061
Wherein,
Figure BDA0002674128230000062
an output function of an ith neuron of any one of the hidden layer and the output layer,
Figure BDA0002674128230000063
to connect the weight parameter of the jth neuron of the l-1 th layer and the ith neuron of the l-1 th layer,
Figure BDA0002674128230000064
the activation function value of the j-th neuron of the l-1 th layer,
Figure BDA0002674128230000065
f (x) is a Sigmoid function,
Figure BDA0002674128230000066
is a bias parameter of the ith neuron of the l-th layer, Ml-1The number of the neurons of the l-1 layer, the output function value of the ith neuron of the input layer is the ith input voice of the deep neural network model, and the activation function value of the ith neuron of the input layer is equal to the ith neuronThe output function value of the element; and training the deep neural network model to obtain each weight parameter and each bias parameter of the neural network model.
Still further, the processor is further configured to: collecting a pure voice command signal and a noise signal of an actual application environment; mixing the pure voice instruction signal with the noise signal to obtain a voice instruction signal with noise, wherein the pure voice instruction is a label value of the voice instruction signal with noise; inputting the voice instruction signal with noise as input voice to an input layer of the deep neural network model to obtain a predicted voice instruction signal which is output by the output layer and corresponds to the voice instruction signal with noise; and comparing the label value of the noisy speech instruction signal with the corresponding predicted speech instruction signal to update each weight parameter and each bias parameter of the deep neural network model.
Still further, the processor is further configured to: determining a cost function value of a predicted voice instruction signal corresponding to the voice instruction signal with noise relative to a tag value thereof by adopting a mean square error algorithm; and continuously updating each weight parameter and each bias parameter of the deep neural network model by using a back propagation process based on the cost function value and adopting a random gradient descent algorithm.
According to yet another aspect of the present invention, there is also provided a computer storage medium having a computer program stored thereon, the computer program when executed implementing the steps of the speech processing method of any of the above.
Drawings
The above features and advantages of the present disclosure will be better understood upon reading the detailed description of embodiments of the disclosure in conjunction with the following drawings.
FIG. 1 is a flow diagram illustrating a method of speech processing in one embodiment according to one aspect of the present invention;
FIG. 2 is a schematic diagram of voice interaction of a cab of a rail transit system, depicted in accordance with an aspect of the present invention;
FIG. 3 is a partial flow diagram of a speech processing method in one embodiment according to one aspect of the present invention;
FIG. 4 is a partial flow diagram of a speech processing method in one embodiment according to one aspect of the present invention;
FIG. 5 is a partial flow diagram of a speech processing method in one embodiment according to an aspect of the present invention;
FIG. 6 is a partial flow diagram of a method of speech processing in one embodiment according to one aspect of the present invention;
FIG. 7 is a partial flow diagram of a method of speech processing in one embodiment according to an aspect of the present invention;
FIG. 8 is a partial flow diagram of a method of speech processing in one embodiment according to an aspect of the present invention;
FIG. 9 is a partial flow diagram of a speech processing method in one embodiment according to an aspect of the present invention;
FIG. 10 is a block diagram of a speech processing apparatus according to another aspect of the present invention.
Detailed Description
The following description is presented to enable any person skilled in the art to make and use the invention and is incorporated in the context of a particular application. Various modifications, as well as various uses in different applications will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to a wide range of embodiments. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the practice of the invention may not necessarily be limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.
The reader's attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All the features disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
It is noted that, where used, further, preferably, still further and more preferably is a brief introduction to the exposition of the alternative embodiment on the basis of the preceding embodiment, the contents of the further, preferably, still further or more preferably back band being combined with the preceding embodiment as a complete constituent of the alternative embodiment. Several further, preferred, still further or more preferred arrangements of the belt after the same embodiment may be combined in any combination to form a further embodiment.
The invention is described in detail below with reference to the figures and specific embodiments. It is noted that the aspects described below in connection with the figures and the specific embodiments are only exemplary and should not be construed as imposing any limitation on the scope of the present invention.
According to one aspect of the invention, a voice processing method is provided, which can be used for processing input voice instructions of a voice interaction system.
The voice interaction system refers to a system for acquiring a voice instruction input by a user and generating a corresponding interaction action, for example, "Siri" on an apple mobile phone, an intelligent robot, an intelligent home, and the like. For a common voice interaction system, there is no complex application environment, and only environmental noise may exist in the background sound, so that the voice instruction sent by the user can be obtained after the environmental noise is removed. However, for the cab of the rail transit system, because the large-screen display system and the voice broadcast system can often send out voice messages, the voice messages can be collected by the voice interaction system through transmission together with voice instructions spoken by the user, and environmental noises such as air conditioner noises and the like can also be mixed, the processing of voice signals collected by the voice interaction system of the cab is more complex than that of a common voice interaction system.
In one embodiment, as shown in FIG. 1, the speech processing method 100 includes steps S110 to S130.
Wherein, step S110 is: and acquiring a voice signal collected by a microphone.
The voice signal refers to a mixed sound collected by a microphone, and specifically, the meaning of the voice signal is described by taking a sound propagation process of a cab of the rail transit system shown in fig. 2 as an example.
As shown in FIG. 2, the large screen display system converts the sound source x1(n) playing in the driver's cab through a loudspeaker, and broadcasting the sound source x by the voice broadcasting system2(n) playing in the cab through another loudspeaker, wherein the sounds played by the two loudspeakers respectively pass through the echo propagation path h1And h2Mixed echo y (n) is formed when the near end of the microphone is reached; ambient noise such as air conditioner noise can form a noise signal s (n) at the near end of the microphone; the voice command actually spoken by the user forms a voice command signal v (n) at the near end of the microphone. It is understood that the three sound signal echoes y (n), the noise signal s (n) and the voice command signal v (n) do not necessarily exist simultaneously, wherein the existence of the sound signal echo y (n) and the noise signal s (n) has a certain contingency, and the voice command signal v (n) exists only when the user speaks. Therefore, when the microphone collects signals, the collected voice signal d (n) may be the three sound signal echoes y (n), the noise signal s (n) and the voice command signalAny combination of numbers v (n).
Since the speech processing method 100 aims to extract an accurate speech command signal v (n), when there is a speech command signal v (n) in the collected speech signal d (n), it is default that there are two other acoustic signal echoes y (n) and a noise signal s (n) in the speech signal d (n), and then the collected speech signal d (n) is subjected to indifferent echo y (n) and noise signal s (n) to obtain the remaining speech command signal v (n). It can be understood by those skilled in the art that the default microphone collects the speech signal as a mixture of the three sounds during the speech processing, but the actually collected speech signal d (n) is not necessarily the mixture of the three sounds, and even if the actually collected speech signal d (n) is not the mixture of the three sounds, the process of speech processing and the final result of speech processing are not affected.
Although the present invention has been described with reference to the cab of fig. 2 as an example for the purpose of voice signal processing, it can be understood by those skilled in the art that the application environment to which the voice processing method 100 is applied is not limited to the existence of echoes of two sound sources, and the echoes collected by the microphone may include echoes of multiple sound sources and multiple environmental noises.
Step S120 is: and eliminating the echo signal in the voice signal by using an echo elimination model to obtain an intermediate voice signal.
The echo cancellation model is a model for estimating an echo collected by a microphone after a sound source generating an echo is played by a speaker by using the sound source. Thus, an echo cancellation model may be used to obtain an estimate of the echo y (n)
Figure BDA0002674128230000101
Reuse of
Figure BDA0002674128230000102
Echo y (n) in speech signal d (n) is removed.
Further, as shown in FIG. 3, step S120 can be embodied as steps S121-S122.
Wherein, step S121 is: the sound source based on the echo utilizes an echo cancellation model to carry out echo estimation to obtainEcho estimation to speech signals
Figure BDA0002674128230000103
Preferably, the echo cancellation model may be formed by an adaptive filter. An adaptive filter refers to a filter that updates parameters and structure of the filter using an adaptive algorithm according to a change in environment. The echo cancellation model may then be constructed using a filter that does not change structure but that is updated with filter coefficients by an adaptive algorithm.
Suppose there are multiple sound sources x1(n)~xm(n) (m is an integer greater than 1), the echo cancellation model includes a corresponding adaptive filter ω1~ωm. Correspondingly, as shown in FIG. 4, the step S121 can be embodied as steps S1211 to S1212.
Step S1211 is: using the plurality of adaptive filters omega1~ωmRespectively for the plurality of sound sources x1(n)~xm(n) performing echo estimation to obtain echo estimation values of the plurality of sound sources, respectively
Figure BDA0002674128230000104
Wherein the echo estimated value of the ith (i is more than or equal to 1 and less than or equal to m) sound source is
Figure BDA0002674128230000105
Step S1212 is: determining a plurality of sound sources x1(n)~xm(n) corresponding echo estimate
Figure BDA0002674128230000106
Summed as an echo estimate of the speech signal d (n)
Figure BDA0002674128230000107
Namely, it is
Figure BDA0002674128230000108
It will be appreciated that the example of the use of the cab of the rail transit system shown in fig. 2 has two sound sources x1(n) and x2(n) and thus corresponding with two adaptive filters omega1And ω2The two adaptive filters are respectively used for the sound source x1(n) and x2(n) performing echo estimation to finally obtain the echo estimation value of the cab
Figure BDA0002674128230000109
Further, step S122 is: subtracting the echo estimate from the speech signal d (n)
Figure BDA00026741282300001010
To obtain an intermediate speech signal d' (n), i.e.
Figure BDA00026741282300001011
Further, in a more preferred embodiment, the adaptive filters ω1~ωmThe filter parameters of (a) can be continuously updated by using the voice signal collected by the microphone when no voice command signal exists. Specifically, as shown in FIG. 5, the speech processing method 100 further includes steps S140-150.
Wherein, step S140 is: judging whether the voice signal d (n) collected by the microphone includes a voice command signal v (n).
In particular, a plurality of sound sources x can be utilized1(n)~xm(n) and the voice signal d (n) collected by the microphone to construct a detection function xi, and using the detection function value to judge whether the voice signal d (n) includes the voice command signal v (n).
In one embodiment, the constructed detection function is as follows:
Figure BDA0002674128230000111
wherein r isxd=E[x(n)d(n)]=Rxxh,
Figure BDA0002674128230000112
Rxx=E[x(n)xT(n)]X (n) is the plurality of sound sources x1(n)~xm(n) is the sum of
Figure BDA0002674128230000113
d (n) is a speech signal collected by a microphone, RxxIs the autocorrelation matrix of x (n), h is the echo path,
Figure BDA0002674128230000114
is the variance of the speech signal d (n),
Figure BDA0002674128230000115
as is the variance of the echo y (n),
Figure BDA0002674128230000116
can be represented by sound sources, i.e.
Figure BDA0002674128230000117
Is the variance of the noise signal s (n),
Figure BDA0002674128230000118
is the variance of the voice command signal v (n).
Will r isxd=Rxxh、
Figure BDA0002674128230000119
And
Figure BDA00026741282300001110
substituting into the detection function (1), then equation (1) can be transformed into:
Figure BDA00026741282300001111
as can be seen from equation (2), when the speech signal d (n) includes only the echo y (n), the detection function value is equal to 1, and when the speech signal d (n) includes the echo y (n), the noise signal s (n) and the speech command signal v (n), the calculated detection function value is obviously smaller than 1. Therefore, the above-constructed detection function can be used to determine whether the voice signal d (n) includes the voice command signal v (n).
Further, as shown in fig. 6, the step S140 may include steps S141 to S143.
Step S141 is: using a plurality of sound sources x1(n)~xm(n) and a voice signal d (n) collected by a microphone to calculate a detection function xi. That is, x (n) and d (n) are substituted into formula (1) or formula (2) to calculate the corresponding detection function value.
Step S142 is: and responding to the detection function value being larger than or equal to a preset threshold value, and judging that the voice signal does not comprise a voice instruction signal.
Step S143 is: and responding to the condition that the detection function value is smaller than the preset threshold value, and judging that the voice signal comprises a voice instruction signal.
The preset threshold value can be set to be slightly less than 1, and when the calculated detection function value is less than the preset threshold value, the voice signal d (n) can be judged to comprise a voice command signal v (n); when the calculated detection function value is greater than or equal to the preset threshold value, it can be determined that the voice signal d (n) does not include the voice command signal v (n).
Further, step S150 is: in response to the voice signals d (n) collected by the microphones not including the voice command signal v (n), utilizing the plurality of sound sources x1(n)~xm(n) updating the plurality of adaptive filters ω1~ωm
It will be appreciated that the adaptive filter omega may be adapted to more closely approximate the echo estimate to the true echo1~ωmThe filtering parameters may be continuously updated based on the speech signal d (n) and the previously filtered residual signal. Specifically, the ith adaptive filter ωiThe update formula of (c) can be as follows:
Figure BDA0002674128230000121
wherein,
Figure BDA0002674128230000122
is the sum of the sound source signals of the m sound sources, i.e.
Figure BDA0002674128230000123
xiFor the sound source signal of the ith sound source in the m sound sources, L is the filter length, mu is the step factor, 0 < mu < 2, and alpha is the protection coefficient. The protection coefficient alpha is used for preventing inner product | x (n) of the sound source x (n) from generating no wind2Too small, which results in a decrease in filter stability, may be set to a small fraction, such as 0.0001.
Then, step S1211 is preferably configured to: in response to the collected voice signals d (n) including the voice command signal v (n) by the microphone, the adaptive filters updated recently are adopted to the sound sources x1(n)~xm(n) performing echo estimation to obtain echo estimation values of the plurality of sound sources, respectively
Figure BDA0002674128230000124
That is, when it is detected that the voice signal includes the voice command signal, the adaptive filter including the filter parameters determined in the last update process is acquired without updating the filter parameters of the adaptive filter (the adaptive filter ω generated in the last executed step S150)1~ωm) To perform echo estimation.
Further, after removing the echo y (n) in the speech signal d (n), it is necessary to remove the noise s (n) in the speech signal d (n). Correspondingly, step S130 is: and removing the noise signal in the intermediate voice signal by using a deep neural network model to obtain a voice instruction signal in the voice signal.
The deep neural network model is a deep learning-based neural network model and comprises an input layer, a hidden layer and an output layer, wherein the hidden layer can comprise a plurality of layers. The neurons of each layer may be constructed separately first and then trained using a deep learning algorithm to obtain the weights and biases for each neuron of each layer.
Further, as shown in fig. 7, the step S130 may include steps S131 to S132.
Step S131 is: and constructing a deep neural network model.
Assuming that the deep neural network model has L layers in common, wherein the hidden layer comprises L-2(L is more than 2) layers, the input layer and the output layer are respectively 1 layer, and the number of neurons of any layer L (1 is more than L and less than or equal to L) layer in the L layers is MlI (i is more than 1 and less than or equal to M) of the ith layerl) The output function of each neuron is
Figure BDA0002674128230000131
Wherein,
Figure BDA0002674128230000132
to connect the weight parameter of the jth neuron of the l-1 th layer and the ith neuron of the l-1 th layer,
Figure BDA0002674128230000133
the activation function value of the j-th neuron of layer l-1, i.e.
Figure BDA0002674128230000134
Is the bias parameter of the ith neuron of the ith layer. In addition, the ith (1 < i.ltoreq.M) of the first layer1) Output function of individual neuron
Figure BDA0002674128230000135
The ith input voice of the input layer of the deep neural network model is input, and the ith (i is more than 1 and less than or equal to M) of the first layer1) Activation function value of individual neuron
Figure BDA0002674128230000136
It will be appreciated that the activation function f (x) may be a Sigmoid function. The Sigmoid function is specifically expressed as follows:
Figure BDA0002674128230000137
f'(x)=f(x)(1-f(x)) (5)
the input speech of the input layer is an amplitude spectrum obtained by converting actual speech by fourier transform. Correspondingly, the final amplitude spectrum is obtained after the output functions of all layers are denoised, and then the denoised actual voice can be obtained by performing inverse Fourier transform on the final amplitude spectrum.
Further, step S132 is: training the deep neural network model to obtain a weight parameter and a bias parameter of the neural network model.
The specific training process may be as shown in fig. 8, and step S132 may include steps S1321 to S1324.
Wherein, step S1321 is: and acquiring a pure voice command signal and a noise signal of an actual application environment.
Taking a cab of a rail transit system as an example of a practical application environment, noise is sound collected when echo and voice command signals do not exist in the cab. The clean voice command signal is the voice of the voice command collected in the environment without noise and echo. It is understood that the clean speech instruction signal used to train the deep neural network model may be any speech, and is not required to be the control command speech in the actual application.
Step S1322 is: and mixing the pure voice instruction signal with the noise signal to obtain a noisy voice instruction signal, wherein the pure voice instruction is a label value of the noisy voice instruction signal.
Step S1323 is: and inputting the voice command signal with noise as input voice to an input layer of the deep neural network model to obtain a predicted voice command signal which is output by the output layer and corresponds to the voice command signal with noise.
It can be understood that the tag value is a pure voice command actually corresponding to the noisy voice command signal, and therefore, the matching degree of the predicted voice command signal corresponding to the noisy voice command signal obtained by using the deep neural network model and the corresponding tag value can be used as a measurement index of the accuracy degree of the deep neural network model.
Step S1324 is: and comparing the label value of the noisy speech instruction signal with the corresponding predicted speech instruction signal to update the weight parameter and the bias parameter of the deep neural network model.
It will be appreciated that a cost function may be constructed to measure how well the tag value of the noisy speech command signal matches its corresponding predicted speech command signal and to update the weight parameters and bias parameters based on the match.
In one embodiment, as shown in FIG. 9, step S1324 may include steps S910-S920.
Step S910 is: and determining a cost function value of the predicted voice command signal corresponding to the noisy voice command signal relative to the label value thereof by using Mean-Square Error (MSE).
Then, the cost function is as follows:
Figure BDA0002674128230000141
wherein M isLThe number of neurons in the output layer of the deep neural network model can be understood as the dimension of the output data, ykThe tag value of the noisy speech command signal corresponding to the kth neuron,
Figure BDA0002674128230000142
the predicted speech command signal is a noisy speech command signal corresponding to the kth neuron.
It can be understood that a smaller cost function of the noisy speech command signal indicates a higher accuracy of the deep neural network model.
Step S920 is: and continuously updating the weight parameters and the bias parameters of the deep neural network model by using a back propagation process and a Stochastic Gradient Descent (SGD) algorithm based on the cost function values.
It can be appreciated that the training process of the deep neural network model is repeated until the accuracy requirement is met. Then, the jth neuron of the L-1 th layer is connected with the ith (1 < i < M) of the L (1 < L < L) th layerl) Weight parameter of individual neuron
Figure BDA0002674128230000143
And the first layerBias parameters for the ith neuron
Figure BDA0002674128230000144
The specific back propagation update procedure of (2) may be as follows:
Figure BDA0002674128230000145
Figure BDA0002674128230000146
wherein,
Figure BDA0002674128230000151
Figure BDA0002674128230000152
in addition, the first and second substrates are,
Figure BDA0002674128230000153
eta is a proportionality coefficient and represents the learning rate of the deep neural network model.
While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.
According to another aspect of the invention, a voice processing device is also provided, which is suitable for voice processing of the input voice command of the voice interaction system.
In one embodiment, as shown in FIG. 10, the speech processing apparatus 1000 includes a memory 1010 and a processor 1020.
The memory 1010 is used for storing computer programs.
The processor 1020 is connected to the memory 1010 for executing the computer program on the memory 1010, and the steps of the speech processing method 100 in any of the above embodiments are implemented when the processor 1020 executes the computer program on the memory 1010.
According to yet another aspect of the present invention, there is also provided a computer storage medium having a computer program stored thereon, wherein the computer program is configured to implement the steps of the speech processing method 100 in any of the above embodiments when executed.
Those of skill in the art would understand that information, signals, and data may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits (bits), symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various illustrative logical modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. It is to be understood that the scope of the invention is to be defined by the appended claims and not by the specific constructions and components of the embodiments illustrated above. Those skilled in the art can make various changes and modifications to the embodiments within the spirit and scope of the present invention, and these changes and modifications also fall within the scope of the present invention.

Claims (21)

1. A method of speech processing comprising:
acquiring a voice signal acquired by a microphone;
eliminating echo in the voice signal by using an echo elimination model to obtain an intermediate voice signal; and
and removing the noise signal in the intermediate voice signal by using a deep neural network model to obtain a voice instruction signal in the voice signal.
2. The speech processing method of claim 1 wherein said canceling the echo in the speech signal using the echo cancellation model for the far-end signal to obtain an intermediate speech signal comprises:
performing echo estimation on the sound source based on the echo by using the echo cancellation model to obtain an echo estimation value of the voice signal; and
subtracting the echo estimation value from the voice signal to obtain the intermediate voice signal.
3. The speech processing method of claim 2 wherein the echoes in the speech signal comprise echoes from a plurality of sound sources, wherein the echo cancellation model comprises a plurality of adaptive filters corresponding to the plurality of sound sources, respectively, and wherein the performing the echo estimation from the echo-based sound source using the echo cancellation model to obtain the echo estimation comprises:
respectively carrying out echo estimation on the plurality of sound sources by adopting the plurality of adaptive filters to respectively obtain echo estimation values of the plurality of sound sources; and
and calculating the sum of the echo estimation values of the sound sources to be used as the echo estimation value of the voice signal.
4. The speech processing method of claim 3, further comprising:
judging whether the voice signal comprises a voice instruction signal or not; and
the step of performing echo estimation on the sound source based on the echo by using the echo cancellation model to obtain an echo estimation value further comprises:
updating the plurality of adaptive filters with the plurality of sound sources in response to not including a voice instruction signal in the voice signal; and
and performing echo estimation on the plurality of sound sources by adopting a plurality of adaptive filters which are updated recently in response to the voice command signals included in the voice signals.
5. The speech processing method of claim 4, wherein the determining whether the speech signal includes a speech instruction signal comprises:
calculating a detection function using the plurality of sound sources and the voice signal collected by the microphone
Figure FDA0002674128220000021
Wherein r isxd=E[x(n)d(n)]=Rxxh,
Figure FDA0002674128220000022
Rxx=E[x(n)xT(n)]X (n) is the sum of the sound sources, d (n) is the speech signal, RxxIs the autocorrelation matrix of x (n), h is the echo path,
Figure FDA0002674128220000023
is the variance of the speech signal d (n),
Figure FDA0002674128220000024
as is the variance of the echo y (n),
Figure FDA0002674128220000025
Figure FDA0002674128220000026
is the variance of the noise signal s (n),
Figure FDA0002674128220000027
is the variance of the voice command signal v (n);
responding to the detection function value being larger than or equal to a preset threshold value, and judging that the voice signal does not comprise a voice instruction signal; and
and responding to the condition that the detection function value is smaller than the preset threshold value, and judging that the voice signal comprises a voice instruction signal.
6. The speech processing method of claim 4, wherein assuming that the plurality of sound sources are m sound sources, the plurality of filters are m filters corresponding to the m sound sources, m > 1, and updating the plurality of adaptive filters using the plurality of sound sources comprises:
updating formulas with parameters
Figure FDA0002674128220000028
Updating an ith adaptive filter of the plurality of adaptive filters, wherein,
Figure FDA0002674128220000029
Figure FDA00026741282200000210
Figure FDA00026741282200000211
y (n) is the speech signal,
Figure FDA00026741282200000212
is the sum of the sound source signals of the m sound sources,
Figure FDA00026741282200000213
xiand L is the sound source signal of the ith sound source in the m sound sources, mu is a step factor, mu is more than 0 and less than 2, and alpha is a protection coefficient.
7. The speech processing method of claim 1 wherein the deep neural network model comprises an input layer, an implicit layer, and an output layer, and wherein removing noise in the intermediate speech signal using the deep neural network model to obtain the speech command signal in the speech signal comprises:
and inputting the intermediate voice signal as input voice to an input layer of the deep neural network model to obtain an output signal of the output layer as the voice instruction signal.
8. The speech processing method of claim 7, further comprising:
constructing the deep neural network model
Figure FDA0002674128220000031
Wherein,
Figure FDA0002674128220000032
an output function of an ith neuron of any one of the hidden layer and the output layer,
Figure FDA0002674128220000033
to connect the weight parameter of the jth neuron of the l-1 th layer and the ith neuron of the l-1 th layer,
Figure FDA0002674128220000034
the activation function value of the j-th neuron of the l-1 th layer,
Figure FDA0002674128220000035
f (x) is a Sigmoid function,
Figure FDA0002674128220000036
is a bias parameter of the ith neuron of the l-th layer, Ml-1The number of the neurons of the l-1 layer is, the output function value of the ith neuron of the input layer is the ith input voice of the deep neural network model, and the activation function value of the ith neuron of the input layer is equal to the output function value of the ith neuron; and
training the deep neural network model to obtain each weight parameter and each bias parameter of the neural network model.
9. The method of speech processing according to claim 8 wherein said training the deep neural network model to derive each weight parameter and each bias parameter of the neural network model comprises:
collecting a pure voice command signal and a noise signal of an actual application environment;
mixing the pure voice instruction signal with the noise signal to obtain a voice instruction signal with noise, wherein the pure voice instruction is a label value of the voice instruction signal with noise;
inputting the voice instruction signal with noise as input voice to an input layer of the deep neural network model to obtain a predicted voice instruction signal which is output by the output layer and corresponds to the voice instruction signal with noise; and
and comparing the label value of the noisy speech instruction signal with the corresponding predicted speech instruction signal to update each weight parameter and each bias parameter of the deep neural network model.
10. The speech processing method of claim 9, wherein said comparing the tag value of the noisy speech instruction signal with its corresponding predicted speech instruction signal to update each weight parameter and each bias parameter of the deep neural network model comprises:
determining a cost function value of a predicted voice instruction signal corresponding to the voice instruction signal with noise relative to a tag value thereof by adopting a mean square error algorithm; and
and continuously updating each weight parameter and each bias parameter of the deep neural network model by using a back propagation process and a random gradient descent algorithm based on the cost function value.
11. A speech processing apparatus comprising:
a memory for storing a computer program; and
a processor coupled to the memory for executing the computer program on the memory, the processor configured to:
acquiring a voice signal acquired by a microphone;
eliminating echo in the voice signal by using an echo elimination model of a far-end signal to obtain an intermediate voice signal; and
and removing the noise signal in the intermediate voice signal by using a deep neural network model to obtain a voice instruction signal in the voice signal.
12. The speech processing apparatus of claim 11, wherein the processor is further configured to:
performing echo estimation on the sound source based on the echo by using the echo cancellation model to obtain an echo estimation value of the voice signal; and
subtracting the echo estimation value from the voice signal to obtain the intermediate voice signal.
13. The speech processing apparatus of claim 12 wherein the echoes in the speech signal comprise echoes from a plurality of sound sources, the echo cancellation model comprises a plurality of adaptive filters corresponding to the plurality of sound sources, respectively, the processor being further configured to:
respectively carrying out echo estimation on the plurality of sound sources by adopting the plurality of adaptive filters to respectively obtain echo estimation values of the plurality of sound sources; and
and calculating the sum of the echo estimation values of the sound sources to be used as the echo estimation value of the voice signal.
14. The speech processing apparatus of claim 13, wherein the processor is further configured to:
judging whether the voice signal comprises a voice instruction signal or not;
updating the plurality of adaptive filters with the plurality of sound sources in response to not including a voice instruction signal in the voice signal; and
and performing echo estimation on the plurality of sound sources by adopting a plurality of adaptive filters which are updated recently in response to the voice command signals included in the voice signals.
15. The speech processing apparatus of claim 14, wherein the processor is further configured to:
calculating a detection function using the plurality of sound sources and the voice signal collected by the microphone
Figure FDA0002674128220000051
Wherein r isxd=E[x(n)d(n)]=Rxxh,
Figure FDA0002674128220000052
Rxx=E[x(n)xT(n)]X (n) is the sum of the sound sources, d (n) is the speech signal, RxxIs the autocorrelation matrix of x (n), h is the echo path,
Figure FDA0002674128220000053
is the variance of the speech signal d (n),
Figure FDA0002674128220000054
as is the variance of the echo y (n),
Figure FDA0002674128220000055
Figure FDA0002674128220000056
is the variance of the noise signal s (n),
Figure FDA0002674128220000057
is the variance of the voice command signal v (n);
responding to the detection function value being larger than or equal to a preset threshold value, and judging that the voice signal does not comprise a voice instruction signal; and
and responding to the condition that the detection function value is smaller than the preset threshold value, and judging that the voice signal comprises a voice instruction signal.
16. The speech processing device of claim 14 wherein, assuming that the plurality of sound sources are m sound sources, the plurality of filters are m filters corresponding to the m sound sources, m > 1, the processor is further configured to:
updating formulas with parameters
Figure FDA0002674128220000058
Updating an ith adaptive filter of the plurality of adaptive filters, wherein,
Figure FDA0002674128220000059
Figure FDA00026741282200000510
Figure FDA00026741282200000511
y (n) is the speech signal,
Figure FDA00026741282200000512
is the sum of the sound source signals of the m sound sources,
Figure FDA00026741282200000513
xiand L is the sound source signal of the ith sound source in the m sound sources, mu is a step factor, mu is more than 0 and less than 2, and alpha is a protection coefficient.
17. The speech processing apparatus of claim 11 wherein the deep neural network model comprises an input layer, a hidden layer, and an output layer, the processor further configured to:
and inputting the intermediate voice signal as input voice to an input layer of the deep neural network model to obtain an output signal of the output layer as the voice instruction signal.
18. The speech processing apparatus of claim 17, wherein the processor is further configured to:
constructing the deep neural network model
Figure FDA0002674128220000061
Wherein,
Figure FDA0002674128220000062
an output function of an ith neuron of any one of the hidden layer and the output layer,
Figure FDA0002674128220000063
to connect the weight parameter of the jth neuron of the l-1 th layer and the ith neuron of the l-1 th layer,
Figure FDA0002674128220000064
the activation function value of the j-th neuron of the l-1 th layer,
Figure FDA0002674128220000065
f (x) is a Sigmoid function,
Figure FDA0002674128220000066
is a bias parameter of the ith neuron of the l-th layer, Ml-1The number of the neurons of the l-1 layer is, the output function value of the ith neuron of the input layer is the ith input voice of the deep neural network model, and the activation function value of the ith neuron of the input layer is equal to the output function value of the ith neuron; and
training the deep neural network model to obtain each weight parameter and each bias parameter of the neural network model.
19. The speech processing device of claim 18 wherein the speech processing device is adapted to a cab of a rail transit vehicle, the processor further configured to:
collecting a pure voice command signal and a noise signal of an actual application environment;
mixing the pure voice instruction signal with the noise signal to obtain a voice instruction signal with noise, wherein the pure voice instruction is a label value of the voice instruction signal with noise;
inputting the voice instruction signal with noise as input voice to an input layer of the deep neural network model to obtain a predicted voice instruction signal which is output by the output layer and corresponds to the voice instruction signal with noise; and
and comparing the label value of the noisy speech instruction signal with the corresponding predicted speech instruction signal to update each weight parameter and each bias parameter of the deep neural network model.
20. The speech processing apparatus of claim 19, wherein the processor is further configured to:
determining a cost function value of a predicted voice instruction signal corresponding to the voice instruction signal with noise relative to a tag value thereof by adopting a mean square error algorithm; and
and continuously updating each weight parameter and each bias parameter of the deep neural network model by using a back propagation process and a random gradient descent algorithm based on the cost function value.
21. A computer storage medium having a computer program stored thereon, wherein the computer program when executed implements the steps of a speech processing method according to any of claims 1-10.
CN202010942560.4A 2020-09-09 Voice processing method and device Active CN114242106B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010942560.4A CN114242106B (en) 2020-09-09 Voice processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010942560.4A CN114242106B (en) 2020-09-09 Voice processing method and device

Publications (2)

Publication Number Publication Date
CN114242106A true CN114242106A (en) 2022-03-25
CN114242106B CN114242106B (en) 2024-10-29

Family

ID=

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101262530A (en) * 2008-04-29 2008-09-10 中兴通讯股份有限公司 A device for eliminating echo of mobile terminal
KR20120022101A (en) * 2010-09-01 2012-03-12 (주)제이유디지탈 Noise reduction method and device in voice communication of iptv
CN104751842A (en) * 2013-12-31 2015-07-01 安徽科大讯飞信息科技股份有限公司 Method and system for optimizing deep neural network
US20160019909A1 (en) * 2013-03-15 2016-01-21 Dolby Laboratories Licensing Corporation Acoustic echo mitigation apparatus and method, audio processing apparatus and voice communication terminal
KR101592425B1 (en) * 2014-09-24 2016-02-05 현대자동차주식회사 Speech preprocessing apparatus, apparatus and method for speech recognition
US9286883B1 (en) * 2013-09-26 2016-03-15 Amazon Technologies, Inc. Acoustic echo cancellation and automatic speech recognition with random noise
CN105957520A (en) * 2016-07-04 2016-09-21 北京邮电大学 Voice state detection method suitable for echo cancellation system
CN107017004A (en) * 2017-05-24 2017-08-04 建荣半导体(深圳)有限公司 Noise suppressing method, audio processing chip, processing module and bluetooth equipment
US20180040333A1 (en) * 2016-08-03 2018-02-08 Apple Inc. System and method for performing speech enhancement using a deep neural network-based signal
US20200105287A1 (en) * 2017-04-14 2020-04-02 Industry-University Cooperation Foundation Hanyang University Deep neural network-based method and apparatus for combining noise and echo removal
CN111161752A (en) * 2019-12-31 2020-05-15 歌尔股份有限公司 Echo cancellation method and device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101262530A (en) * 2008-04-29 2008-09-10 中兴通讯股份有限公司 A device for eliminating echo of mobile terminal
KR20120022101A (en) * 2010-09-01 2012-03-12 (주)제이유디지탈 Noise reduction method and device in voice communication of iptv
US20160019909A1 (en) * 2013-03-15 2016-01-21 Dolby Laboratories Licensing Corporation Acoustic echo mitigation apparatus and method, audio processing apparatus and voice communication terminal
US9286883B1 (en) * 2013-09-26 2016-03-15 Amazon Technologies, Inc. Acoustic echo cancellation and automatic speech recognition with random noise
CN104751842A (en) * 2013-12-31 2015-07-01 安徽科大讯飞信息科技股份有限公司 Method and system for optimizing deep neural network
KR101592425B1 (en) * 2014-09-24 2016-02-05 현대자동차주식회사 Speech preprocessing apparatus, apparatus and method for speech recognition
CN105957520A (en) * 2016-07-04 2016-09-21 北京邮电大学 Voice state detection method suitable for echo cancellation system
US20180040333A1 (en) * 2016-08-03 2018-02-08 Apple Inc. System and method for performing speech enhancement using a deep neural network-based signal
US20200105287A1 (en) * 2017-04-14 2020-04-02 Industry-University Cooperation Foundation Hanyang University Deep neural network-based method and apparatus for combining noise and echo removal
CN107017004A (en) * 2017-05-24 2017-08-04 建荣半导体(深圳)有限公司 Noise suppressing method, audio processing chip, processing module and bluetooth equipment
CN111161752A (en) * 2019-12-31 2020-05-15 歌尔股份有限公司 Echo cancellation method and device

Similar Documents

Publication Publication Date Title
CN101980336B (en) Hidden Markov model-based vehicle sound identification method
Cevher et al. Vehicle speed estimation using acoustic wave patterns
CN110197670A (en) Audio defeat method, apparatus and electronic equipment
US9536510B2 (en) Sound system including an engine sound synthesizer
CN112509584B (en) Sound source position determining method and device and electronic equipment
CN105844211A (en) System and method for classifying a road surface
CN105473988A (en) Method of determining noise sound contributions of noise sources of a motorized vehicle
CN111653289B (en) Playback voice detection method
CN101483414A (en) Voice intelligibility enhancement system and voice intelligibility enhancement method
CN107176123A (en) Sound detection information providing method, vehicle periphery sound detection device and vehicle
CN102739834A (en) Voice call apparatus and vehicle mounted apparatus
JP2006313997A (en) Noise level estimating device
Ambrosini et al. Deep neural networks for road surface roughness classification from acoustic signals
CN116052630A (en) Active control system and method for automobile road noise
Lee et al. Statistical model‐based noise reduction approach for car interior applications to speech recognition
Liutkus et al. Informed source separation using latent components
CN114242106B (en) Voice processing method and device
CN114242106A (en) Voice processing method and device
Berdich et al. Fingerprinting smartphones based on microphone characteristics from environment affected recordings
Krishnamurthy et al. Car noise verification and applications
US9978399B2 (en) Method and apparatus for tuning speech recognition systems to accommodate ambient noise
Wang et al. A speech enhancement system for automotive speech recognition with a hybrid voice activity detection method
Peplow et al. Exterior auralization of traffic noise within the LISTEN project
CN117746879A (en) Method and system for exchanging sound inside and outside vehicle and vehicle
Hansen Getting started with the CU-Move corpus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant