CN105448303B

CN105448303B - Voice signal processing method and device

Info

Publication number: CN105448303B
Application number: CN201510866175.5A
Authority: CN
Inventors: 时雪煜; 李先刚; 邹赛赛
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-11-27
Filing date: 2015-11-27
Publication date: 2020-02-04
Anticipated expiration: 2035-11-27
Also published as: CN105448303A

Abstract

The invention provides a method and a device for processing a voice signal, wherein the method comprises the following steps: collecting a noise sample signal; processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise; a speech model is trained based on the noisy speech sample signal and the clean speech sample signal. According to the voice signal processing method provided by the embodiment of the invention, the accuracy of voice recognition in a noise environment can be greatly improved through the voice model, and the robustness of voice recognition service and the experience of the voice recognition service are improved.

Description

Voice signal processing method and device

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for processing a speech signal.

Background

Speech recognition refers to the automatic conversion of human speech into corresponding text by a robot. In recent years, speech recognition technology has been rapidly developed, and particularly, after a deep neural network is applied to speech recognition, the performance of a recognition system is greatly improved.

In the related art, an acoustic model and a language model are obtained by training a large number of pure voice samples in the voice recognition process. The larger the training sample is, the higher the accuracy is, the better the obtained acoustic model effect is, and the higher the accuracy of the speech recognition is.

However, with the development of the mobile internet, the voice input mode is more and more common, the voice users are more and more widespread, and the environment where each user uses is greatly different, especially in a noise environment, such as vehicle-mounted noise in the driving process of an automobile, crowd noise generated in a restaurant or other places where people are crowded, and the like. However, the existing speech recognition training lacks a noise speech sample, and the acoustic characteristics of the noise speech sample and the acoustic characteristics of the pure speech sample are greatly different, so that the accuracy of speech recognition of an acoustic model in the related technology in a quiet environment is high, and the accuracy of speech recognition in a noise environment is greatly reduced.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present invention is to provide a method for processing a speech signal, which greatly improves accuracy of speech recognition in a noisy environment, and improves robustness of a speech recognition service and experience of the speech recognition service.

A second object of the present invention is to provide a processing apparatus for speech signals.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides a method for processing a speech signal, including the following steps: collecting a noise sample signal; processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise; and training a voice model according to the noise voice sample signal and the pure voice sample signal.

According to the voice signal processing method, the noise voice samples are generated according to the noise samples and the pure voice samples of different scenes, and the voice model is trained according to the noise voice samples and the pure voice samples, so that voice signals in various noise environments can be converted into voice signals in a quiet environment through the voice model, the accuracy of voice recognition in the noise environment is greatly improved, and the robustness of voice recognition service and the experience of the voice recognition service are improved.

In order to achieve the above object, a second embodiment of the present invention provides a speech signal processing apparatus, including: the first acquisition module is used for acquiring a noise sample signal; the first processing module is used for processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise; and the first training module is used for training a voice model according to the noise voice sample signal and the pure voice sample signal.

According to the voice signal processing device, the noise voice samples are generated according to the noise samples and the pure voice samples of different scenes, and the voice model is trained according to the noise voice samples and the pure voice samples, so that voice signals in various noise environments can be converted into voice signals in a quiet environment through the voice model, the accuracy of voice recognition in the noise environment is greatly improved, and the robustness of voice recognition service and the experience of the voice recognition service are improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a method of processing a speech signal according to one embodiment of the invention;

FIG. 2 is a flow chart of a method of processing a speech signal according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method of processing a speech signal according to another embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a speech signal processing apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a speech signal processing apparatus according to another embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

A method and apparatus for processing a speech signal according to an embodiment of the present invention will be described below with reference to the accompanying drawings.

A method of processing a speech signal, comprising the steps of: a. collecting a noise sample signal; b. processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise; c. a speech model is trained based on the noisy speech sample signal and the clean speech sample signal.

Fig. 1 is a flowchart of a speech signal processing method according to an embodiment of the present invention.

As shown in fig. 1, the processing method of the voice signal includes the following steps:

and S101, collecting a noise sample signal.

Specifically, scene noise that may occur during speech recognition is collected as a noise sample signal, where the scene noise may be collected in a plurality of different scenes, for example, vehicle-mounted noise during driving of a vehicle, crowd noise generated in a restaurant, or crowd noise generated in other places with dense crowd is collected as the noise sample signal. Furthermore, the more the noise sample signals are collected, the higher the accuracy of processing the collected voice signals in different environments, and the higher the accuracy of voice recognition.

And S102, processing the pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise.

The clean speech sample signal is a speech sample signal in a quiet environment, i.e., a speech signal containing no noise signal. That is, the voice sample signal in the quiet environment is subjected to the noise adding processing through the collected noise sample signal, so as to obtain the voice sample signal in the noise environment, i.e. the noise voice sample signal.

It should be understood that the noise processing of the clean speech sample signal can be performed by using the existing processing method, and the redundancy is avoided.

And S103, training a voice model according to the noise voice sample signal and the pure voice sample signal.

In one embodiment of the present invention, training the speech model based on the noisy speech sample signal and the clean speech signal further comprises: and extracting the acoustic characteristics of the noise voice sample signal and the pure voice sample signal, and establishing a mapping relation between the acoustic characteristics of the noise voice sample signal and the acoustic characteristics of the pure voice sample signal.

Specifically, the speech model can be obtained by extracting acoustic features of the noise speech sample signal and the clean speech sample signal, and establishing a mapping from the acoustic features of the noise speech sample signal to the acoustic features of the clean speech sample signal through a recurrent neural network.

In the voice recognition process, the voice model is obtained based on the noise voice sample signal, so that the voice signal in the noise environment can be mapped into the voice signal in the quiet environment, the voice signal in the noise environment can be accurately recognized, and the accuracy of the voice recognition is improved. Meanwhile, the recurrent neural network has stronger robustness, and can well establish the mapping from the voice signal in a noise environment to the voice signal in a quiet environment for the scene noise without training, so that the voice signal in the scene noise without training is accurately identified, and the accuracy of voice identification is improved.

Fig. 2 is a flow chart of a speech signal processing method according to an embodiment of the invention.

As shown in fig. 2, the processing method of the voice signal includes the following steps:

s201, collecting a noise sample signal.

S202, processing the pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise.

And S203, training a voice model according to the noise voice sample signal and the pure voice sample signal.

And S204, collecting voice signals input by the user.

Specifically, a voice signal of the user may be collected through a voice input device, such as a microphone, and then the collected voice signal may be sent to the server for voice recognition. The trained voice model can be stored in a voice recognition cloud, and the collected voice signal is sent to the cloud for voice recognition.

S205, determine whether the speech signal contains noise.

Specifically, after receiving a voice signal input by a user, the server performs signal-to-noise ratio estimation on the voice signal input by the user, so as to classify the voice signal input by the user. For example, when the signal-to-noise ratio of a voice signal input by a user is less than a certain value, it is determined that the voice signal contains noise; and when the signal-to-noise ratio of the voice signal recorded by the user is greater than a certain value, judging that the voice signal does not contain noise.

S206, if the voice signal contains noise, denoising the voice signal according to the voice model.

Specifically, if it is determined that the speech signal contains noise, it may be determined that the speech signal is recorded in a noisy environment, and at this time, the speech signal needs to be denoised according to a speech model pre-stored in the server, that is, the speech signal recorded by the user is converted into a speech signal in a quiet environment through a recurrent neural network.

In an embodiment of the present invention, the collected voice signal containing noise is converted into a voice signal containing no noise according to the mapping relation of the noise voice sample and the clean voice sample in the noise environment stored in the voice model.

And S207, performing voice recognition on the voice signal subjected to the denoising processing according to the acoustic model.

Specifically, after denoising is performed on a voice signal input by a user, voice recognition is performed through a decoder of the server, that is, the decoder decodes the denoised voice signal according to an acoustic model prestored in the server, converts the voice signal into text information, and then feeds back a recognition result to the user. Wherein the acoustic model is obtained by training a large number of clean speech samples.

S208, if the voice signal does not contain noise, performing voice recognition on the voice signal according to the acoustic model.

Specifically, if the voice signal is judged not to contain noise, it can be determined that the voice signal is recorded in a quiet environment, at this time, the voice signal does not need to be denoised by a voice model, but the voice signal is directly decoded by a decoder of the server according to an acoustic model, converted into text information, and then the recognition result is fed back to the user.

According to the voice signal processing method, in the voice recognition process, the voice signal recorded in the noise environment is preprocessed, the voice signal is converted into the voice signal in the quiet environment and then subjected to voice recognition, and the voice signal recorded in the quiet environment is directly subjected to voice recognition, so that the accuracy of voice recognition in the quiet environment can be guaranteed, the accuracy of voice recognition in the noise environment can be greatly improved, and the accuracy, robustness and service experience of voice recognition service are improved.

Fig. 3 is a flow chart of a speech signal processing method according to another embodiment of the present invention.

As shown in fig. 3, the processing method of the voice signal includes the following steps:

s301, collecting a noise sample signal.

And S302, processing the pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise.

S303, training a voice model according to the noise voice sample signal and the pure voice sample signal.

S304, obtaining a voice training sample signal.

Specifically, in the speech recognition process, even if the speech signal collected in the noise environment is preprocessed, that is, the speech signal is denoised according to the speech model, the preprocessed speech signal may also include the noise signal, and therefore, in this embodiment, the acoustic model used for speech recognition is retrained according to the recurrent neural network, so that the retrained acoustic model and the preprocessed speech signal can be more matched, and the accuracy of speech recognition is further improved.

The voice training sample signal is voice training data used for retraining the acoustic model, and the voice training sample signal is a voice signal in a noise environment, namely noise voice training data.

S305, denoising the voice training sample signal according to the voice model, and training an acoustic model according to the denoised voice training sample signal.

Specifically, the acoustic features of the voice training samples are extracted firstly, then the acoustic features of the voice training samples are mapped through a recurrent neural network according to a language model, and the processed acoustic features are retrained to the existing acoustic model, so that an acoustic model which is more matched with the acoustic features processed by the voice model is trained.

And S306, collecting the voice signal input by the user.

S307, it is determined whether the speech signal contains noise.

S308, if the voice signal contains noise, denoising the voice signal according to the voice model.

And S309, performing voice recognition on the voice signal subjected to the denoising processing according to the acoustic model.

According to the voice signal processing method, the existing acoustic model is retrained through the voice training sample, so that the retrained acoustic model is matched with the preprocessed voice signal better, the accuracy of voice recognition is further improved, and the experience of voice recognition service is improved.

In order to implement the above embodiments, the present invention further provides a processing apparatus for a speech signal.

Fig. 4 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention.

As shown in fig. 4, the speech signal processing apparatus includes: a first acquisition module 10, a first processing module 20 and a first training module 30.

The first acquisition module 10 is configured to acquire a noise sample signal. Specifically, the first collecting module 10 collects scene noise that may occur during the speech recognition process as a noise sample signal, where the scene noise may be collected in a plurality of different scenes, for example, collecting vehicle-mounted noise during the driving process of an automobile, collecting crowd noise generated in a restaurant, or collecting crowd noise generated in other places with dense crowd, and the like as the noise sample signal. Furthermore, the more noise sample signals are collected by the first collection module 10, the higher the accuracy of processing the collected voice signals in different environments, and the higher the accuracy of voice recognition.

The first processing module 20 is configured to process a pre-stored clean speech sample signal according to the noise sample signal, so as to obtain a noise speech sample signal with noise. The clean speech sample signal is a speech sample signal in a quiet environment, i.e., a speech signal containing no noise signal. That is, the first processing module 20 performs noise processing on the speech sample signal in the quiet environment through the collected noise sample signal to obtain the speech sample signal in the noise environment, i.e., the noise speech sample signal.

The first training module 30 is used for training a speech model according to the noise speech sample signal and the clean speech sample signal. The first training module 30 extracts acoustic features of the noise speech sample signal and the clean speech sample signal, and establishes a mapping relationship between the acoustic features of the noise speech sample signal and the acoustic features of the clean speech sample signal. Specifically, the first training module 30 may obtain the speech model by extracting acoustic features of the noise speech sample signal and the clean speech sample signal, and establishing a mapping from the acoustic features of the noise speech sample signal to the acoustic features of the clean speech sample signal through a recurrent neural network.

Fig. 5 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention.

As shown in fig. 5, the processing apparatus of a voice signal includes: a first acquisition module 10, a first processing module 20, a first training module 30, a second acquisition module 40, a second processing module 50, and a speech recognition module 60.

The second collecting module 40 is configured to collect a voice signal input by a user. Specifically, the second collecting module 40 may collect a voice signal of the user through a voice input device, such as a microphone, and then send the collected voice signal to the server for voice recognition. The first training module 30 may store the trained voice model to a voice recognition cloud, and the second acquisition module 40 transmits the acquired voice signal to the cloud for voice recognition.

The second processing module 50 is configured to perform denoising processing on the voice signal according to the voice model when the voice signal contains noise. Specifically, the second processing module 50 performs signal-to-noise ratio estimation on the voice signal input by the user after receiving the voice signal acquired by the second acquisition module 40, so as to classify the voice signal input by the user. For example, when the signal-to-noise ratio of a voice signal input by a user is less than a certain value, it is determined that the voice signal contains noise; and when the signal-to-noise ratio of the voice signal recorded by the user is greater than a certain value, judging that the voice signal does not contain noise. If the voice signal is judged to contain noise, the second processing module 50 may determine that the voice signal is recorded in a noise environment, and at this time, the voice signal needs to be denoised according to a voice model pre-stored in the server, that is, the voice signal recorded by the user is converted into a voice signal in a quiet environment through a recurrent neural network.

The speech recognition module 60 is configured to perform speech recognition on the denoised speech signal according to the acoustic model. The speech recognition module 60 is further configured to perform speech recognition on the speech signal according to the acoustic model when the speech signal does not contain noise. Specifically, after the second processing module 50 performs denoising processing on the voice signal, the voice recognition module 60 performs voice recognition through a decoder of the server, that is, the decoder decodes the denoised voice signal according to an acoustic model pre-stored in the server, converts the voice signal into text information, and then feeds back a recognition result to the user. Wherein the acoustic model is obtained by training a large number of clean speech samples.

Specifically, if it is determined that the voice signal does not contain noise, the second processing module 50 may determine that the voice signal is recorded in a quiet environment, and at this time, the second processing module 50 is not required to perform denoising processing on the voice signal through a voice model, but the voice recognition module 60 directly decodes the voice signal according to an acoustic model through a decoder of a server, converts the voice signal into text information, and then feeds back a recognition result to the user.

According to the processing device of the voice signal, the voice signal recorded in the noise environment is preprocessed in the voice recognition process, the voice signal is converted into the voice signal in the quiet environment and then subjected to voice recognition, and the voice signal recorded in the quiet environment is directly subjected to voice recognition, so that the accuracy of the voice recognition in the quiet environment can be guaranteed, the accuracy of the voice recognition in the noise environment can be greatly improved, and the accuracy, robustness and service experience of voice recognition service are improved.

As shown in fig. 6, the speech signal processing apparatus includes: a first acquisition module 10, a first processing module 20, a first training module 30, a second acquisition module 40, a second processing module 50, a speech recognition module 60, an acquisition module 70, a third processing module 80, and a second training module 90.

The obtaining module 70 is configured to obtain a voice training sample signal. The voice training sample signal is voice training data used for retraining the acoustic model, and the voice training sample signal is a voice signal in a noise environment, namely noise voice training data.

The third processing module 80 is configured to perform denoising processing on the speech training sample signal according to the speech model. The second training module 90 is configured to train the acoustic model according to the denoised speech training sample signal. Specifically, the third processing module 80 extracts the acoustic features of the speech training samples, then maps the acoustic features of the speech training samples according to the language model through the recurrent neural network, and the second training module 90 retrains the processed acoustic features to the existing acoustic models, so as to train the acoustic models that are more matched with the acoustic features processed by the speech models.

According to the voice signal processing device, the existing acoustic model is retrained through the voice training sample, so that the retrained acoustic model is matched with the preprocessed voice signal better, the accuracy of voice recognition is further improved, and the experience of voice recognition service is improved.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for processing a speech signal, comprising the steps of:

acquiring a noise sample signal, wherein the noise sample signal comprises different scene noise;

processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise;

training a voice model according to the noise voice sample signal and the pure voice sample signal, wherein the voice model comprises a mapping relation of acoustic characteristics of the noise voice sample signal and acoustic characteristics of the pure voice sample signal, which is established through a recurrent neural network;

collecting voice signals input by a user;

when the voice signal contains noise, denoising the voice signal according to the voice model;

and performing voice recognition on the voice signal subjected to the denoising processing according to an acoustic model, wherein the acoustic model is obtained by pure voice sample training.

2. The method of processing a speech signal according to claim 1, wherein training a speech model based on the noisy speech sample signal and the clean speech sample signal further comprises:

and extracting the acoustic characteristics of the noise voice sample signal and the pure voice sample signal, and establishing a mapping relation between the acoustic characteristics of the noise voice sample signal and the acoustic characteristics of the pure voice sample signal.

3. The method for processing a speech signal according to claim 1, further comprising:

and when the voice signal does not contain noise, performing voice recognition on the voice signal according to the acoustic model.

4. The method for processing a speech signal according to claim 3, further comprising:

acquiring a voice training sample signal;

denoising the voice training sample signal according to the voice model, and training the acoustic model according to the denoised voice training sample signal.

5. An apparatus for processing a speech signal, comprising:

a first acquisition module, configured to acquire a noise sample signal, where the noise sample signal includes different scene noises;

the first processing module is used for processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise;

the first training module is used for training a voice model according to the noise voice sample signal and the pure voice sample signal, wherein the voice model comprises a mapping relation of the acoustic characteristics of the noise voice sample signal and the acoustic characteristics of the pure voice sample signal, which is established through a recurrent neural network;

the second acquisition module is used for acquiring voice signals input by a user;

the second processing module is used for carrying out denoising processing on the voice signal according to the voice model when the voice signal contains noise;

and the voice recognition module is used for carrying out voice recognition on the voice signal subjected to the denoising processing according to an acoustic model, and the acoustic model is obtained by pure voice sample training.

6. The apparatus for processing the speech signal according to claim 5, wherein the first training module is further configured to:

7. The apparatus for processing the speech signal according to claim 5, wherein the speech recognition module is further configured to perform speech recognition on the speech signal according to the acoustic model when the speech signal does not contain noise.

8. The apparatus for processing a speech signal according to claim 7, further comprising:

the acquisition module is used for acquiring a voice training sample signal;

the third processing module is used for carrying out denoising processing on the voice training sample signal according to the voice model;

and the second training module is used for training the acoustic model according to the voice training sample signal after denoising processing.