CN105448303B - Voice signal processing method and device - Google Patents

Voice signal processing method and device Download PDF

Info

Publication number
CN105448303B
CN105448303B CN201510866175.5A CN201510866175A CN105448303B CN 105448303 B CN105448303 B CN 105448303B CN 201510866175 A CN201510866175 A CN 201510866175A CN 105448303 B CN105448303 B CN 105448303B
Authority
CN
China
Prior art keywords
voice
noise
signal
sample signal
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510866175.5A
Other languages
Chinese (zh)
Other versions
CN105448303A (en
Inventor
时雪煜
李先刚
邹赛赛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510866175.5A priority Critical patent/CN105448303B/en
Publication of CN105448303A publication Critical patent/CN105448303A/en
Application granted granted Critical
Publication of CN105448303B publication Critical patent/CN105448303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention provides a method and a device for processing a voice signal, wherein the method comprises the following steps: collecting a noise sample signal; processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise; a speech model is trained based on the noisy speech sample signal and the clean speech sample signal. According to the voice signal processing method provided by the embodiment of the invention, the accuracy of voice recognition in a noise environment can be greatly improved through the voice model, and the robustness of voice recognition service and the experience of the voice recognition service are improved.

Description

Voice signal processing method and device
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for processing a speech signal.
Background
Speech recognition refers to the automatic conversion of human speech into corresponding text by a robot. In recent years, speech recognition technology has been rapidly developed, and particularly, after a deep neural network is applied to speech recognition, the performance of a recognition system is greatly improved.
In the related art, an acoustic model and a language model are obtained by training a large number of pure voice samples in the voice recognition process. The larger the training sample is, the higher the accuracy is, the better the obtained acoustic model effect is, and the higher the accuracy of the speech recognition is.
However, with the development of the mobile internet, the voice input mode is more and more common, the voice users are more and more widespread, and the environment where each user uses is greatly different, especially in a noise environment, such as vehicle-mounted noise in the driving process of an automobile, crowd noise generated in a restaurant or other places where people are crowded, and the like. However, the existing speech recognition training lacks a noise speech sample, and the acoustic characteristics of the noise speech sample and the acoustic characteristics of the pure speech sample are greatly different, so that the accuracy of speech recognition of an acoustic model in the related technology in a quiet environment is high, and the accuracy of speech recognition in a noise environment is greatly reduced.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, a first objective of the present invention is to provide a method for processing a speech signal, which greatly improves accuracy of speech recognition in a noisy environment, and improves robustness of a speech recognition service and experience of the speech recognition service.
A second object of the present invention is to provide a processing apparatus for speech signals.
In order to achieve the above object, an embodiment of a first aspect of the present invention provides a method for processing a speech signal, including the following steps: collecting a noise sample signal; processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise; and training a voice model according to the noise voice sample signal and the pure voice sample signal.
According to the voice signal processing method, the noise voice samples are generated according to the noise samples and the pure voice samples of different scenes, and the voice model is trained according to the noise voice samples and the pure voice samples, so that voice signals in various noise environments can be converted into voice signals in a quiet environment through the voice model, the accuracy of voice recognition in the noise environment is greatly improved, and the robustness of voice recognition service and the experience of the voice recognition service are improved.
In order to achieve the above object, a second embodiment of the present invention provides a speech signal processing apparatus, including: the first acquisition module is used for acquiring a noise sample signal; the first processing module is used for processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise; and the first training module is used for training a voice model according to the noise voice sample signal and the pure voice sample signal.
According to the voice signal processing device, the noise voice samples are generated according to the noise samples and the pure voice samples of different scenes, and the voice model is trained according to the noise voice samples and the pure voice samples, so that voice signals in various noise environments can be converted into voice signals in a quiet environment through the voice model, the accuracy of voice recognition in the noise environment is greatly improved, and the robustness of voice recognition service and the experience of the voice recognition service are improved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a method of processing a speech signal according to one embodiment of the invention;
FIG. 2 is a flow chart of a method of processing a speech signal according to an embodiment of the present invention;
FIG. 3 is a flow chart of a method of processing a speech signal according to another embodiment of the present invention;
fig. 4 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a speech signal processing apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a speech signal processing apparatus according to another embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
A method and apparatus for processing a speech signal according to an embodiment of the present invention will be described below with reference to the accompanying drawings.
A method of processing a speech signal, comprising the steps of: a. collecting a noise sample signal; b. processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise; c. a speech model is trained based on the noisy speech sample signal and the clean speech sample signal.
Fig. 1 is a flowchart of a speech signal processing method according to an embodiment of the present invention.
As shown in fig. 1, the processing method of the voice signal includes the following steps:
and S101, collecting a noise sample signal.
Specifically, scene noise that may occur during speech recognition is collected as a noise sample signal, where the scene noise may be collected in a plurality of different scenes, for example, vehicle-mounted noise during driving of a vehicle, crowd noise generated in a restaurant, or crowd noise generated in other places with dense crowd is collected as the noise sample signal. Furthermore, the more the noise sample signals are collected, the higher the accuracy of processing the collected voice signals in different environments, and the higher the accuracy of voice recognition.
And S102, processing the pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise.
The clean speech sample signal is a speech sample signal in a quiet environment, i.e., a speech signal containing no noise signal. That is, the voice sample signal in the quiet environment is subjected to the noise adding processing through the collected noise sample signal, so as to obtain the voice sample signal in the noise environment, i.e. the noise voice sample signal.
It should be understood that the noise processing of the clean speech sample signal can be performed by using the existing processing method, and the redundancy is avoided.
And S103, training a voice model according to the noise voice sample signal and the pure voice sample signal.
In one embodiment of the present invention, training the speech model based on the noisy speech sample signal and the clean speech signal further comprises: and extracting the acoustic characteristics of the noise voice sample signal and the pure voice sample signal, and establishing a mapping relation between the acoustic characteristics of the noise voice sample signal and the acoustic characteristics of the pure voice sample signal.
Specifically, the speech model can be obtained by extracting acoustic features of the noise speech sample signal and the clean speech sample signal, and establishing a mapping from the acoustic features of the noise speech sample signal to the acoustic features of the clean speech sample signal through a recurrent neural network.
In the voice recognition process, the voice model is obtained based on the noise voice sample signal, so that the voice signal in the noise environment can be mapped into the voice signal in the quiet environment, the voice signal in the noise environment can be accurately recognized, and the accuracy of the voice recognition is improved. Meanwhile, the recurrent neural network has stronger robustness, and can well establish the mapping from the voice signal in a noise environment to the voice signal in a quiet environment for the scene noise without training, so that the voice signal in the scene noise without training is accurately identified, and the accuracy of voice identification is improved.
According to the voice signal processing method, the noise voice samples are generated according to the noise samples and the pure voice samples of different scenes, and the voice model is trained according to the noise voice samples and the pure voice samples, so that voice signals in various noise environments can be converted into voice signals in a quiet environment through the voice model, the accuracy of voice recognition in the noise environment is greatly improved, and the robustness of voice recognition service and the experience of the voice recognition service are improved.
Fig. 2 is a flow chart of a speech signal processing method according to an embodiment of the invention.
As shown in fig. 2, the processing method of the voice signal includes the following steps:
s201, collecting a noise sample signal.
Specifically, scene noise that may occur during speech recognition is collected as a noise sample signal, where the scene noise may be collected in a plurality of different scenes, for example, vehicle-mounted noise during driving of a vehicle, crowd noise generated in a restaurant, or crowd noise generated in other places with dense crowd is collected as the noise sample signal. Furthermore, the more the noise sample signals are collected, the higher the accuracy of processing the collected voice signals in different environments, and the higher the accuracy of voice recognition.
S202, processing the pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise.
The clean speech sample signal is a speech sample signal in a quiet environment, i.e., a speech signal containing no noise signal. That is, the voice sample signal in the quiet environment is subjected to the noise adding processing through the collected noise sample signal, so as to obtain the voice sample signal in the noise environment, i.e. the noise voice sample signal.
It should be understood that the noise processing of the clean speech sample signal can be performed by using the existing processing method, and the redundancy is avoided.
And S203, training a voice model according to the noise voice sample signal and the pure voice sample signal.
In one embodiment of the present invention, training the speech model based on the noisy speech sample signal and the clean speech signal further comprises: and extracting the acoustic characteristics of the noise voice sample signal and the pure voice sample signal, and establishing a mapping relation between the acoustic characteristics of the noise voice sample signal and the acoustic characteristics of the pure voice sample signal.
Specifically, the speech model can be obtained by extracting acoustic features of the noise speech sample signal and the clean speech sample signal, and establishing a mapping from the acoustic features of the noise speech sample signal to the acoustic features of the clean speech sample signal through a recurrent neural network.
In the voice recognition process, the voice model is obtained based on the noise voice sample signal, so that the voice signal in the noise environment can be mapped into the voice signal in the quiet environment, the voice signal in the noise environment can be accurately recognized, and the accuracy of the voice recognition is improved. Meanwhile, the recurrent neural network has stronger robustness, and can well establish the mapping from the voice signal in a noise environment to the voice signal in a quiet environment for the scene noise without training, so that the voice signal in the scene noise without training is accurately identified, and the accuracy of voice identification is improved.
And S204, collecting voice signals input by the user.
Specifically, a voice signal of the user may be collected through a voice input device, such as a microphone, and then the collected voice signal may be sent to the server for voice recognition. The trained voice model can be stored in a voice recognition cloud, and the collected voice signal is sent to the cloud for voice recognition.
S205, determine whether the speech signal contains noise.
Specifically, after receiving a voice signal input by a user, the server performs signal-to-noise ratio estimation on the voice signal input by the user, so as to classify the voice signal input by the user. For example, when the signal-to-noise ratio of a voice signal input by a user is less than a certain value, it is determined that the voice signal contains noise; and when the signal-to-noise ratio of the voice signal recorded by the user is greater than a certain value, judging that the voice signal does not contain noise.
S206, if the voice signal contains noise, denoising the voice signal according to the voice model.
Specifically, if it is determined that the speech signal contains noise, it may be determined that the speech signal is recorded in a noisy environment, and at this time, the speech signal needs to be denoised according to a speech model pre-stored in the server, that is, the speech signal recorded by the user is converted into a speech signal in a quiet environment through a recurrent neural network.
In an embodiment of the present invention, the collected voice signal containing noise is converted into a voice signal containing no noise according to the mapping relation of the noise voice sample and the clean voice sample in the noise environment stored in the voice model.
And S207, performing voice recognition on the voice signal subjected to the denoising processing according to the acoustic model.
Specifically, after denoising is performed on a voice signal input by a user, voice recognition is performed through a decoder of the server, that is, the decoder decodes the denoised voice signal according to an acoustic model prestored in the server, converts the voice signal into text information, and then feeds back a recognition result to the user. Wherein the acoustic model is obtained by training a large number of clean speech samples.
S208, if the voice signal does not contain noise, performing voice recognition on the voice signal according to the acoustic model.
Specifically, if the voice signal is judged not to contain noise, it can be determined that the voice signal is recorded in a quiet environment, at this time, the voice signal does not need to be denoised by a voice model, but the voice signal is directly decoded by a decoder of the server according to an acoustic model, converted into text information, and then the recognition result is fed back to the user.
According to the voice signal processing method, in the voice recognition process, the voice signal recorded in the noise environment is preprocessed, the voice signal is converted into the voice signal in the quiet environment and then subjected to voice recognition, and the voice signal recorded in the quiet environment is directly subjected to voice recognition, so that the accuracy of voice recognition in the quiet environment can be guaranteed, the accuracy of voice recognition in the noise environment can be greatly improved, and the accuracy, robustness and service experience of voice recognition service are improved.
Fig. 3 is a flow chart of a speech signal processing method according to another embodiment of the present invention.
As shown in fig. 3, the processing method of the voice signal includes the following steps:
s301, collecting a noise sample signal.
Specifically, scene noise that may occur during speech recognition is collected as a noise sample signal, where the scene noise may be collected in a plurality of different scenes, for example, vehicle-mounted noise during driving of a vehicle, crowd noise generated in a restaurant, or crowd noise generated in other places with dense crowd is collected as the noise sample signal. Furthermore, the more the noise sample signals are collected, the higher the accuracy of processing the collected voice signals in different environments, and the higher the accuracy of voice recognition.
And S302, processing the pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise.
The clean speech sample signal is a speech sample signal in a quiet environment, i.e., a speech signal containing no noise signal. That is, the voice sample signal in the quiet environment is subjected to the noise adding processing through the collected noise sample signal, so as to obtain the voice sample signal in the noise environment, i.e. the noise voice sample signal.
It should be understood that the noise processing of the clean speech sample signal can be performed by using the existing processing method, and the redundancy is avoided.
S303, training a voice model according to the noise voice sample signal and the pure voice sample signal.
In one embodiment of the present invention, training the speech model based on the noisy speech sample signal and the clean speech signal further comprises: and extracting the acoustic characteristics of the noise voice sample signal and the pure voice sample signal, and establishing a mapping relation between the acoustic characteristics of the noise voice sample signal and the acoustic characteristics of the pure voice sample signal.
Specifically, the speech model can be obtained by extracting acoustic features of the noise speech sample signal and the clean speech sample signal, and establishing a mapping from the acoustic features of the noise speech sample signal to the acoustic features of the clean speech sample signal through a recurrent neural network.
In the voice recognition process, the voice model is obtained based on the noise voice sample signal, so that the voice signal in the noise environment can be mapped into the voice signal in the quiet environment, the voice signal in the noise environment can be accurately recognized, and the accuracy of the voice recognition is improved. Meanwhile, the recurrent neural network has stronger robustness, and can well establish the mapping from the voice signal in a noise environment to the voice signal in a quiet environment for the scene noise without training, so that the voice signal in the scene noise without training is accurately identified, and the accuracy of voice identification is improved.
S304, obtaining a voice training sample signal.
Specifically, in the speech recognition process, even if the speech signal collected in the noise environment is preprocessed, that is, the speech signal is denoised according to the speech model, the preprocessed speech signal may also include the noise signal, and therefore, in this embodiment, the acoustic model used for speech recognition is retrained according to the recurrent neural network, so that the retrained acoustic model and the preprocessed speech signal can be more matched, and the accuracy of speech recognition is further improved.
The voice training sample signal is voice training data used for retraining the acoustic model, and the voice training sample signal is a voice signal in a noise environment, namely noise voice training data.
S305, denoising the voice training sample signal according to the voice model, and training an acoustic model according to the denoised voice training sample signal.
Specifically, the acoustic features of the voice training samples are extracted firstly, then the acoustic features of the voice training samples are mapped through a recurrent neural network according to a language model, and the processed acoustic features are retrained to the existing acoustic model, so that an acoustic model which is more matched with the acoustic features processed by the voice model is trained.
And S306, collecting the voice signal input by the user.
Specifically, a voice signal of the user may be collected through a voice input device, such as a microphone, and then the collected voice signal may be sent to the server for voice recognition. The trained voice model can be stored in a voice recognition cloud, and the collected voice signal is sent to the cloud for voice recognition.
S307, it is determined whether the speech signal contains noise.
Specifically, after receiving a voice signal input by a user, the server performs signal-to-noise ratio estimation on the voice signal input by the user, so as to classify the voice signal input by the user. For example, when the signal-to-noise ratio of a voice signal input by a user is less than a certain value, it is determined that the voice signal contains noise; and when the signal-to-noise ratio of the voice signal recorded by the user is greater than a certain value, judging that the voice signal does not contain noise.
S308, if the voice signal contains noise, denoising the voice signal according to the voice model.
Specifically, if it is determined that the speech signal contains noise, it may be determined that the speech signal is recorded in a noisy environment, and at this time, the speech signal needs to be denoised according to a speech model pre-stored in the server, that is, the speech signal recorded by the user is converted into a speech signal in a quiet environment through a recurrent neural network.
In an embodiment of the present invention, the collected voice signal containing noise is converted into a voice signal containing no noise according to the mapping relation of the noise voice sample and the clean voice sample in the noise environment stored in the voice model.
And S309, performing voice recognition on the voice signal subjected to the denoising processing according to the acoustic model.
Specifically, after denoising is performed on a voice signal input by a user, voice recognition is performed through a decoder of the server, that is, the decoder decodes the denoised voice signal according to an acoustic model prestored in the server, converts the voice signal into text information, and then feeds back a recognition result to the user. Wherein the acoustic model is obtained by training a large number of clean speech samples.
According to the voice signal processing method, the existing acoustic model is retrained through the voice training sample, so that the retrained acoustic model is matched with the preprocessed voice signal better, the accuracy of voice recognition is further improved, and the experience of voice recognition service is improved.
In order to implement the above embodiments, the present invention further provides a processing apparatus for a speech signal.
Fig. 4 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention.
As shown in fig. 4, the speech signal processing apparatus includes: a first acquisition module 10, a first processing module 20 and a first training module 30.
The first acquisition module 10 is configured to acquire a noise sample signal. Specifically, the first collecting module 10 collects scene noise that may occur during the speech recognition process as a noise sample signal, where the scene noise may be collected in a plurality of different scenes, for example, collecting vehicle-mounted noise during the driving process of an automobile, collecting crowd noise generated in a restaurant, or collecting crowd noise generated in other places with dense crowd, and the like as the noise sample signal. Furthermore, the more noise sample signals are collected by the first collection module 10, the higher the accuracy of processing the collected voice signals in different environments, and the higher the accuracy of voice recognition.
The first processing module 20 is configured to process a pre-stored clean speech sample signal according to the noise sample signal, so as to obtain a noise speech sample signal with noise. The clean speech sample signal is a speech sample signal in a quiet environment, i.e., a speech signal containing no noise signal. That is, the first processing module 20 performs noise processing on the speech sample signal in the quiet environment through the collected noise sample signal to obtain the speech sample signal in the noise environment, i.e., the noise speech sample signal.
The first training module 30 is used for training a speech model according to the noise speech sample signal and the clean speech sample signal. The first training module 30 extracts acoustic features of the noise speech sample signal and the clean speech sample signal, and establishes a mapping relationship between the acoustic features of the noise speech sample signal and the acoustic features of the clean speech sample signal. Specifically, the first training module 30 may obtain the speech model by extracting acoustic features of the noise speech sample signal and the clean speech sample signal, and establishing a mapping from the acoustic features of the noise speech sample signal to the acoustic features of the clean speech sample signal through a recurrent neural network.
According to the voice signal processing device, the noise voice samples are generated according to the noise samples and the pure voice samples of different scenes, and the voice model is trained according to the noise voice samples and the pure voice samples, so that voice signals in various noise environments can be converted into voice signals in a quiet environment through the voice model, the accuracy of voice recognition in the noise environment is greatly improved, and the robustness of voice recognition service and the experience of the voice recognition service are improved.
Fig. 5 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention.
As shown in fig. 5, the processing apparatus of a voice signal includes: a first acquisition module 10, a first processing module 20, a first training module 30, a second acquisition module 40, a second processing module 50, and a speech recognition module 60.
The second collecting module 40 is configured to collect a voice signal input by a user. Specifically, the second collecting module 40 may collect a voice signal of the user through a voice input device, such as a microphone, and then send the collected voice signal to the server for voice recognition. The first training module 30 may store the trained voice model to a voice recognition cloud, and the second acquisition module 40 transmits the acquired voice signal to the cloud for voice recognition.
The second processing module 50 is configured to perform denoising processing on the voice signal according to the voice model when the voice signal contains noise. Specifically, the second processing module 50 performs signal-to-noise ratio estimation on the voice signal input by the user after receiving the voice signal acquired by the second acquisition module 40, so as to classify the voice signal input by the user. For example, when the signal-to-noise ratio of a voice signal input by a user is less than a certain value, it is determined that the voice signal contains noise; and when the signal-to-noise ratio of the voice signal recorded by the user is greater than a certain value, judging that the voice signal does not contain noise. If the voice signal is judged to contain noise, the second processing module 50 may determine that the voice signal is recorded in a noise environment, and at this time, the voice signal needs to be denoised according to a voice model pre-stored in the server, that is, the voice signal recorded by the user is converted into a voice signal in a quiet environment through a recurrent neural network.
The speech recognition module 60 is configured to perform speech recognition on the denoised speech signal according to the acoustic model. The speech recognition module 60 is further configured to perform speech recognition on the speech signal according to the acoustic model when the speech signal does not contain noise. Specifically, after the second processing module 50 performs denoising processing on the voice signal, the voice recognition module 60 performs voice recognition through a decoder of the server, that is, the decoder decodes the denoised voice signal according to an acoustic model pre-stored in the server, converts the voice signal into text information, and then feeds back a recognition result to the user. Wherein the acoustic model is obtained by training a large number of clean speech samples.
Specifically, if it is determined that the voice signal does not contain noise, the second processing module 50 may determine that the voice signal is recorded in a quiet environment, and at this time, the second processing module 50 is not required to perform denoising processing on the voice signal through a voice model, but the voice recognition module 60 directly decodes the voice signal according to an acoustic model through a decoder of a server, converts the voice signal into text information, and then feeds back a recognition result to the user.
According to the processing device of the voice signal, the voice signal recorded in the noise environment is preprocessed in the voice recognition process, the voice signal is converted into the voice signal in the quiet environment and then subjected to voice recognition, and the voice signal recorded in the quiet environment is directly subjected to voice recognition, so that the accuracy of the voice recognition in the quiet environment can be guaranteed, the accuracy of the voice recognition in the noise environment can be greatly improved, and the accuracy, robustness and service experience of voice recognition service are improved.
Fig. 6 is a schematic structural diagram of a speech signal processing apparatus according to another embodiment of the present invention.
As shown in fig. 6, the speech signal processing apparatus includes: a first acquisition module 10, a first processing module 20, a first training module 30, a second acquisition module 40, a second processing module 50, a speech recognition module 60, an acquisition module 70, a third processing module 80, and a second training module 90.
The obtaining module 70 is configured to obtain a voice training sample signal. The voice training sample signal is voice training data used for retraining the acoustic model, and the voice training sample signal is a voice signal in a noise environment, namely noise voice training data.
The third processing module 80 is configured to perform denoising processing on the speech training sample signal according to the speech model. The second training module 90 is configured to train the acoustic model according to the denoised speech training sample signal. Specifically, the third processing module 80 extracts the acoustic features of the speech training samples, then maps the acoustic features of the speech training samples according to the language model through the recurrent neural network, and the second training module 90 retrains the processed acoustic features to the existing acoustic models, so as to train the acoustic models that are more matched with the acoustic features processed by the speech models.
According to the voice signal processing device, the existing acoustic model is retrained through the voice training sample, so that the retrained acoustic model is matched with the preprocessed voice signal better, the accuracy of voice recognition is further improved, and the experience of voice recognition service is improved.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (8)

1. A method for processing a speech signal, comprising the steps of:
acquiring a noise sample signal, wherein the noise sample signal comprises different scene noise;
processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise;
training a voice model according to the noise voice sample signal and the pure voice sample signal, wherein the voice model comprises a mapping relation of acoustic characteristics of the noise voice sample signal and acoustic characteristics of the pure voice sample signal, which is established through a recurrent neural network;
collecting voice signals input by a user;
when the voice signal contains noise, denoising the voice signal according to the voice model;
and performing voice recognition on the voice signal subjected to the denoising processing according to an acoustic model, wherein the acoustic model is obtained by pure voice sample training.
2. The method of processing a speech signal according to claim 1, wherein training a speech model based on the noisy speech sample signal and the clean speech sample signal further comprises:
and extracting the acoustic characteristics of the noise voice sample signal and the pure voice sample signal, and establishing a mapping relation between the acoustic characteristics of the noise voice sample signal and the acoustic characteristics of the pure voice sample signal.
3. The method for processing a speech signal according to claim 1, further comprising:
and when the voice signal does not contain noise, performing voice recognition on the voice signal according to the acoustic model.
4. The method for processing a speech signal according to claim 3, further comprising:
acquiring a voice training sample signal;
denoising the voice training sample signal according to the voice model, and training the acoustic model according to the denoised voice training sample signal.
5. An apparatus for processing a speech signal, comprising:
a first acquisition module, configured to acquire a noise sample signal, where the noise sample signal includes different scene noises;
the first processing module is used for processing a pre-stored pure voice sample signal according to the noise sample signal to obtain a noise voice sample signal with noise;
the first training module is used for training a voice model according to the noise voice sample signal and the pure voice sample signal, wherein the voice model comprises a mapping relation of the acoustic characteristics of the noise voice sample signal and the acoustic characteristics of the pure voice sample signal, which is established through a recurrent neural network;
the second acquisition module is used for acquiring voice signals input by a user;
the second processing module is used for carrying out denoising processing on the voice signal according to the voice model when the voice signal contains noise;
and the voice recognition module is used for carrying out voice recognition on the voice signal subjected to the denoising processing according to an acoustic model, and the acoustic model is obtained by pure voice sample training.
6. The apparatus for processing the speech signal according to claim 5, wherein the first training module is further configured to:
and extracting the acoustic characteristics of the noise voice sample signal and the pure voice sample signal, and establishing a mapping relation between the acoustic characteristics of the noise voice sample signal and the acoustic characteristics of the pure voice sample signal.
7. The apparatus for processing the speech signal according to claim 5, wherein the speech recognition module is further configured to perform speech recognition on the speech signal according to the acoustic model when the speech signal does not contain noise.
8. The apparatus for processing a speech signal according to claim 7, further comprising:
the acquisition module is used for acquiring a voice training sample signal;
the third processing module is used for carrying out denoising processing on the voice training sample signal according to the voice model;
and the second training module is used for training the acoustic model according to the voice training sample signal after denoising processing.
CN201510866175.5A 2015-11-27 2015-11-27 Voice signal processing method and device Active CN105448303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510866175.5A CN105448303B (en) 2015-11-27 2015-11-27 Voice signal processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510866175.5A CN105448303B (en) 2015-11-27 2015-11-27 Voice signal processing method and device

Publications (2)

Publication Number Publication Date
CN105448303A CN105448303A (en) 2016-03-30
CN105448303B true CN105448303B (en) 2020-02-04

Family

ID=55558409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510866175.5A Active CN105448303B (en) 2015-11-27 2015-11-27 Voice signal processing method and device

Country Status (1)

Country Link
CN (1) CN105448303B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106409289B (en) * 2016-09-23 2019-06-28 合肥美的智能科技有限公司 Environment self-adaption method, speech recognition equipment and the household electrical appliance of speech recognition
CN106328126B (en) * 2016-10-20 2019-08-16 北京云知声信息技术有限公司 Far field voice recognition processing method and device
CN106557164A (en) * 2016-11-18 2017-04-05 北京光年无限科技有限公司 It is applied to the multi-modal output intent and device of intelligent robot
CN106888392A (en) * 2017-02-14 2017-06-23 广东九联科技股份有限公司 A kind of Set Top Box automatic translation system and method
CN109427340A (en) * 2017-08-22 2019-03-05 杭州海康威视数字技术股份有限公司 A kind of sound enhancement method, device and electronic equipment
CN108022596A (en) * 2017-11-28 2018-05-11 湖南海翼电子商务股份有限公司 Audio signal processing method and vehicle electronic device
CN108335694B (en) 2018-02-01 2021-10-15 北京百度网讯科技有限公司 Far-field environment noise processing method, device, equipment and storage medium
CN108428446B (en) * 2018-03-06 2020-12-25 北京百度网讯科技有限公司 Speech recognition method and device
CN110503967B (en) * 2018-05-17 2021-11-19 中国移动通信有限公司研究院 Voice enhancement method, device, medium and equipment
CN109036412A (en) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 voice awakening method and system
CN109378010A (en) * 2018-10-29 2019-02-22 珠海格力电器股份有限公司 Neural network model training method, voice denoising method and device
CN109616100B (en) * 2019-01-03 2022-06-24 百度在线网络技术(北京)有限公司 Method and device for generating voice recognition model
CN111862945A (en) * 2019-05-17 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
CN110570845B (en) * 2019-08-15 2021-10-22 武汉理工大学 Voice recognition method based on domain invariant features
CN111243573B (en) * 2019-12-31 2022-11-01 深圳市瑞讯云技术有限公司 Voice training method and device
CN111081223B (en) * 2019-12-31 2023-10-13 广州市百果园信息技术有限公司 Voice recognition method, device, equipment and storage medium
CN110875050B (en) * 2020-01-17 2020-05-08 深圳亿智时代科技有限公司 Voice data collection method, device, equipment and medium for real scene
CN111354374A (en) * 2020-03-13 2020-06-30 北京声智科技有限公司 Voice processing method, model training method and electronic equipment
CN112201227B (en) * 2020-09-28 2024-06-28 海尔优家智能科技(北京)有限公司 Speech sample generation method and device, storage medium and electronic device
CN112259113B (en) * 2020-09-30 2024-07-30 清华大学苏州汽车研究院(相城) Preprocessing system for improving accuracy of in-vehicle voice recognition and control method thereof
CN113053404A (en) * 2021-03-22 2021-06-29 三一重机有限公司 Method and device for interaction between inside and outside of cab

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633842B1 (en) * 1999-10-22 2003-10-14 Texas Instruments Incorporated Speech recognition front-end feature extraction for noisy speech
JP4590692B2 (en) * 2000-06-28 2010-12-01 パナソニック株式会社 Acoustic model creation apparatus and method
US6876966B1 (en) * 2000-10-16 2005-04-05 Microsoft Corporation Pattern recognition training method and apparatus using inserted noise followed by noise reduction
US7363221B2 (en) * 2003-08-19 2008-04-22 Microsoft Corporation Method of noise reduction using instantaneous signal-to-noise ratio as the principal quantity for optimal estimation
US7680656B2 (en) * 2005-06-28 2010-03-16 Microsoft Corporation Multi-sensory speech enhancement using a speech-state model
CN101154383B (en) * 2006-09-29 2010-10-06 株式会社东芝 Method and device for noise suppression, phonetic feature extraction, speech recognition and training voice model
KR101253102B1 (en) * 2009-09-30 2013-04-10 한국전자통신연구원 Apparatus for filtering noise of model based distortion compensational type for voice recognition and method thereof
CN103000174B (en) * 2012-11-26 2015-06-24 河海大学 Feature compensation method based on rapid noise estimation in speech recognition system
CN104485103B (en) * 2014-11-21 2017-09-01 东南大学 A kind of multi-environment model isolated word recognition method based on vector Taylor series
CN104900232A (en) * 2015-04-20 2015-09-09 东南大学 Isolation word identification method based on double-layer GMM structure and VTS feature compensation

Also Published As

Publication number Publication date
CN105448303A (en) 2016-03-30

Similar Documents

Publication Publication Date Title
CN105448303B (en) Voice signal processing method and device
CN107068161B (en) Speech noise reduction method and device based on artificial intelligence and computer equipment
CN108615535B (en) Voice enhancement method and device, intelligent voice equipment and computer equipment
CN110415687B (en) Voice processing method, device, medium and electronic equipment
US10602267B2 (en) Sound signal processing apparatus and method for enhancing a sound signal
CN105632486B (en) Voice awakening method and device of intelligent hardware
KR102324776B1 (en) Method for diagnosing noise cause of vehicle
KR101734829B1 (en) Voice data recognition method, device and server for distinguishing regional accent
CN111034222A (en) Sound collecting device, sound collecting method, and program
RU2407074C2 (en) Speech enhancement with multiple sensors using preceding clear speech
CN112509584A (en) Sound source position determining method and device and electronic equipment
CN103229517A (en) A device comprising a plurality of audio sensors and a method of operating the same
CN109979478A (en) Voice de-noising method and device, storage medium and electronic equipment
CN111447325A (en) Call auxiliary method, device, terminal and storage medium
CN107274892A (en) Method for distinguishing speek person and device
CN116959471A (en) Voice enhancement method, training method of voice enhancement network and electronic equipment
JP2010197998A (en) Audio signal processing system and autonomous robot having such system
CN112420079A (en) Voice endpoint detection method and device, storage medium and electronic equipment
CN110689901B (en) Voice noise reduction method and device, electronic equipment and readable storage medium
CN110556128B (en) Voice activity detection method and device and computer readable storage medium
CN116416996A (en) Multimode voice recognition system and method based on millimeter wave radar
CN108899041B (en) Voice signal noise adding method, device and storage medium
CN115910037A (en) Voice signal extraction method and device, readable storage medium and electronic equipment
CN110322894B (en) Sound-based oscillogram generation and panda detection method
EP3309777A1 (en) Device and method for audio frame processing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant