CN105448303A - Voice signal processing method and apparatus - Google Patents

Voice signal processing method and apparatus Download PDF

Info

Publication number
CN105448303A
CN105448303A CN201510866175.5A CN201510866175A CN105448303A CN 105448303 A CN105448303 A CN 105448303A CN 201510866175 A CN201510866175 A CN 201510866175A CN 105448303 A CN105448303 A CN 105448303A
Authority
CN
China
Prior art keywords
signal
noise
voice signal
speech
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510866175.5A
Other languages
Chinese (zh)
Other versions
CN105448303B (en
Inventor
时雪煜
李先刚
邹赛赛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510866175.5A priority Critical patent/CN105448303B/en
Publication of CN105448303A publication Critical patent/CN105448303A/en
Application granted granted Critical
Publication of CN105448303B publication Critical patent/CN105448303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention provides a voice signal processing method and apparatus. The voice signal processing method comprises the steps: collecting noise sample signals; according to the noise sample signals, processing a preset pure voice sample signal, and obtaining a noise voice sample signal with noise; according to the noise voice sample signal and the pure voice sample signal, training a voice model. The voice signal processing method can greatly improve the accuracy for voice recognition under the noise environment through the voice model and improve the robustness of voice recognition service and the experience of voice recognition service.

Description

The disposal route of voice signal and device
Technical field
The present invention relates to technical field of voice recognition, particularly relate to a kind of disposal route of voice signal and a kind for the treatment of apparatus of voice signal.
Background technology
Speech recognition refers to, by machine, the voice of people is automatically converted to corresponding text.In recent years, speech recognition technology development is comparatively rapid, and after particularly deep neural network is applied to speech recognition, the performance of recognition system obtains and increases substantially.
In correlation technique, speech recognition process, by training lot of pure speech samples, obtains acoustic model and language model.Training sample is larger, and degree of accuracy is higher, and the acoustic model effect obtained is better, and the accuracy rate of speech recognition is also higher.
But along with the development of mobile Internet, phonetic entry mode is more and more general, voice use crowd also more and more extensive, and the environment that each user uses has very big-difference, particularly in a noisy environment, vehicle-mounted noise in such as vehicle traveling process, in the crowd noises etc. that dining room or other places that the crowd is dense produce.And the training of existing voice identification lacks noise speech sample, the acoustic feature of noise speech sample and clean speech sample is widely different simultaneously, the accuracy rate of the speech recognition of the acoustic model therefore in correlation technique under quiet environment is very high, and the accuracy rate of speech recognition in a noisy environment will reduce greatly.
Summary of the invention
The present invention is intended to solve one of technical matters in correlation technique at least to a certain extent.
For this reason, first object of the present invention is the disposal route proposing a kind of voice signal, and the method substantially increases the accuracy of speech recognition in a noisy environment, promotes the robustness of speech-recognition services and the experience of speech-recognition services.
Second object of the present invention is the treating apparatus proposing a kind of voice signal.
For reaching above-mentioned purpose, first aspect present invention embodiment proposes a kind of disposal route of voice signal, comprises the following steps: acquisition noise sample signal; According to described noise sample signal, the clean speech sample signal prestored is processed, obtain the noisy noise speech sample signal of tool; According to described noise speech sample signal and described clean speech sample signal training utterance model.
The disposal route of the voice signal of the embodiment of the present invention, according to noise sample and the clean speech sample generted noise speech samples of different scene, and according to noise speech sample and clean speech sample training speech model, thus the voice signal that by this speech model the voice signal under various noise circumstance can be converted under quiet environment, thus substantially increase the accuracy of speech recognition in a noisy environment, promote the robustness of speech-recognition services and the experience of speech-recognition services.
For reaching above-mentioned purpose, second aspect present invention embodiment proposes a kind for the treatment of apparatus of voice signal, comprising: the first acquisition module, for acquisition noise sample signal; First processing module, for processing the clean speech sample signal prestored according to described noise sample signal, obtains the noisy noise speech sample signal of tool; First training module, for according to described noise speech sample signal and described clean speech sample signal training utterance model.
The treating apparatus of the voice signal of the embodiment of the present invention, according to noise sample and the clean speech sample generted noise speech samples of different scene, and according to noise speech sample and clean speech sample training speech model, thus the voice signal that by this speech model the voice signal under various noise circumstance can be converted under quiet environment, thus substantially increase the accuracy of speech recognition in a noisy environment, promote the robustness of speech-recognition services and the experience of speech-recognition services.
The aspect that the present invention adds and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.
Accompanying drawing explanation
The present invention above-mentioned and/or additional aspect and advantage will become obvious and easy understand from the following description of the accompanying drawings of embodiments, wherein:
Fig. 1 is the process flow diagram of the disposal route of the voice signal of one embodiment of the invention;
Fig. 2 is the process flow diagram of the disposal route of the voice signal of the present invention's specific embodiment;
Fig. 3 is the process flow diagram of the disposal route of the voice signal of another specific embodiment of the present invention;
Fig. 4 is the structural representation of the treating apparatus of the voice signal of one embodiment of the invention;
Fig. 5 is the structural representation of the treating apparatus of the voice signal of the present invention's specific embodiment;
Fig. 6 is the structural representation of the treating apparatus of the voice signal of another embodiment of the present invention.
Embodiment
Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Be exemplary below by the embodiment be described with reference to the drawings, be intended to for explaining the present invention, and can not limitation of the present invention be interpreted as.
In addition, term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance or imply the quantity indicating indicated technical characteristic.Thus, be limited with " first ", the feature of " second " can express or impliedly comprise one or more these features.In describing the invention, the implication of " multiple " is two or more, unless otherwise expressly limited specifically.
Describe and can be understood in process flow diagram or in this any process otherwise described or method, represent and comprise one or more for realizing the module of the code of the executable instruction of the step of specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can not according to order that is shown or that discuss, comprise according to involved function by the mode while of basic or by contrary order, carry out n-back test, this should understand by embodiments of the invention person of ordinary skill in the field.
Below with reference to the accompanying drawings disposal route according to the voice signal of the embodiment of the present invention and device are described.
A disposal route for voice signal, comprises the following steps: a, acquisition noise sample signal; B, according to noise sample signal, the clean speech sample signal prestored to be processed, obtain the noisy noise speech sample signal of tool; C, according to noise speech sample signal and clean speech sample signal training utterance model.
Fig. 1 is the process flow diagram of the disposal route of the voice signal of one embodiment of the invention.
As shown in Figure 1, the disposal route of voice signal comprises the following steps:
S101, acquisition noise sample signal.
Particularly, be captured in the scene noise that may occur in speech recognition process using as noise sample signal, wherein, scene noise can be collect under multiple different scene, such as, gather the vehicle-mounted noise in vehicle traveling process, be captured in crowd noises that dining room produces or gather crowd noises that other places that the crowd is dense produce etc. as noise sample signal.And then the noise sample signal of collection is more, higher to the degree of accuracy that processes of voice signal gathered under various circumstances, the accuracy rate of speech recognition is higher.
S102, processes the clean speech sample signal prestored according to noise sample signal, obtains the noisy noise speech sample signal of tool.
Wherein, clean speech sample signal is the speech samples signal under quiet environment, does not namely comprise the voice signal of noise signal.That is, by the noise sample signal gathered, process of making an uproar is added to the speech samples signal under quiet environment, to obtain the speech samples signal under noise circumstance, i.e. noise speech sample signal.
Should be understood that, process of making an uproar is added to clean speech sample signal and can adopt existing disposal route, in order to avoid redundancy, no longer multiple superfluous herein.
S103, according to noise speech sample signal and clean speech sample signal training utterance model.
In one embodiment of the invention, also comprise according to noise speech sample signal and clean speech signal training utterance model: the acoustic feature extracting noise speech sample signal and clean speech sample signal, and set up the mapping relations of the acoustic feature of noise speech sample signal and the acoustic feature of clean speech sample signal.
Particularly, can by extracting the acoustic feature of noise speech sample signal and clean speech sample signal, and set up the mapping of acoustic feature to the acoustic feature of clean speech sample signal of noise speech sample signal by recurrent neural networks, to obtain speech model.
In speech recognition process, because speech model obtains based on noise speech sample signal, therefore, it is possible to the voice signal under noise circumstance is mapped to the voice signal under quiet environment, thus accurately identify the voice signal under noise circumstance, improve the accuracy rate of speech recognition.Simultaneously, because recurrent neural networks has stronger robustness, for the scene noise not adding training, can be good at setting up the mapping of the voice signal of voice signal to quiet environment under noise circumstance equally, thus the voice signal under accurately identifying the scene noise not adding training, improve the accuracy rate of speech recognition.
The disposal route of the voice signal of the embodiment of the present invention, according to noise sample and the clean speech sample generted noise speech samples of different scene, and according to noise speech sample and clean speech sample training speech model, thus the voice signal that by this speech model the voice signal under various noise circumstance can be converted under quiet environment, thus substantially increase the accuracy of speech recognition in a noisy environment, promote the robustness of speech-recognition services and the experience of speech-recognition services.
Fig. 2 is the process flow diagram of the disposal route of the voice signal of the present invention's specific embodiment.
As shown in Figure 2, the disposal route of voice signal comprises the following steps:
S201, acquisition noise sample signal.
Particularly, be captured in the scene noise that may occur in speech recognition process using as noise sample signal, wherein, scene noise can be collect under multiple different scene, such as, gather the vehicle-mounted noise in vehicle traveling process, be captured in crowd noises that dining room produces or gather crowd noises that other places that the crowd is dense produce etc. as noise sample signal.And then the noise sample signal of collection is more, higher to the degree of accuracy that processes of voice signal gathered under various circumstances, the accuracy rate of speech recognition is higher.
S202, processes the clean speech sample signal prestored according to noise sample signal, obtains the noisy noise speech sample signal of tool.
Wherein, clean speech sample signal is the speech samples signal under quiet environment, does not namely comprise the voice signal of noise signal.That is, by the noise sample signal gathered, process of making an uproar is added to the speech samples signal under quiet environment, to obtain the speech samples signal under noise circumstance, i.e. noise speech sample signal.
Should be understood that, process of making an uproar is added to clean speech sample signal and can adopt existing disposal route, in order to avoid redundancy, no longer multiple superfluous herein.
S203, according to noise speech sample signal and clean speech sample signal training utterance model.
In one embodiment of the invention, also comprise according to noise speech sample signal and clean speech signal training utterance model: the acoustic feature extracting noise speech sample signal and clean speech sample signal, and set up the mapping relations of the acoustic feature of noise speech sample signal and the acoustic feature of clean speech sample signal.
Particularly, can by extracting the acoustic feature of noise speech sample signal and clean speech sample signal, and set up the mapping of acoustic feature to the acoustic feature of clean speech sample signal of noise speech sample signal by recurrent neural networks, to obtain speech model.
In speech recognition process, because speech model obtains based on noise speech sample signal, therefore, it is possible to the voice signal under noise circumstance is mapped to the voice signal under quiet environment, thus accurately identify the voice signal under noise circumstance, improve the accuracy rate of speech recognition.Simultaneously, because recurrent neural networks has stronger robustness, for the scene noise not adding training, can be good at setting up the mapping of the voice signal of voice signal to quiet environment under noise circumstance equally, thus the voice signal under accurately identifying the scene noise not adding training, improve the accuracy rate of speech recognition.
S204, gathers the voice signal of user's typing.
Particularly, by the voice signal of voice-input device as collection users such as microphones, then the voice signal of collection can be sent to service end and carry out speech recognition.Wherein, the speech model of training can be stored into speech recognition high in the clouds, the voice signal of collection be sent to high in the clouds and carry out speech recognition.
S205, judges whether voice signal comprises noise.
Particularly, service end, after the voice signal receiving user's typing, carries out SNR estimation to the voice signal of user's typing, to classify to the voice signal of user's typing.Such as, when the signal to noise ratio (S/N ratio) of the voice signal of user's typing is less than certain value, then judge that voice signal comprises noise; When the signal to noise ratio (S/N ratio) of the voice signal of user's typing is greater than certain value, then judge that voice signal does not comprise noise.
S206, if voice signal comprises noise, then carries out denoising according to speech model to voice signal.
Particularly, if judge that voice signal comprises noise, then can determine that voice signal is typing in a noisy environment, now need to carry out denoising to voice signal according to the speech model be stored in advance in service end, namely by recurrent neural networks the voice signal of user's typing converted to the voice signal under quiet environment.
In one embodiment of the invention, according to the mapping relations of noise speech sample and clean speech sample in this noise circumstance of preserving in speech model, the voice signal comprising noise gathered is converted to the voice signal not comprising noise.
S207, carries out speech recognition according to acoustic model to the voice signal after denoising.
Particularly, after the voice signal inputted user carries out denoising, speech recognition is carried out by the demoder of service end, namely the acoustic model that demoder prestores according to service end is decoded to the voice signal after denoising, voice signal is converted to text message, then recognition result is fed back to user.Wherein, acoustic model is by obtaining the training of lot of pure speech samples.
S208, if voice signal does not comprise noise, then carries out speech recognition according to acoustic model to voice signal.
Particularly, if judge that voice signal does not comprise noise, then can determine that voice signal is typing under quiet environment, now without the need to carrying out denoising by speech model to voice signal, but directly voice signal is decoded according to acoustic model by the demoder of service end, voice signal is converted to text message, then recognition result is fed back to user.
The disposal route of the voice signal of the embodiment of the present invention, in speech recognition process, pre-service is carried out to the voice signal of typing under noise circumstance, the voice signal be converted to by voice signal under quiet environment carries out speech recognition again, and speech recognition is directly carried out to the voice signal of typing under peace and quiet bad border, thus, the accuracy rate of speech recognition under quiet environment can not only be ensured, and greatly can improve the accuracy rate of speech recognition under noisy environment, thus improve the accuracy rate of speech-recognition services and robustness and service experience.
Fig. 3 is the process flow diagram of the disposal route of the voice signal of another specific embodiment of the present invention.
As shown in Figure 3, the disposal route of voice signal comprises the following steps:
S301, acquisition noise sample signal.
Particularly, be captured in the scene noise that may occur in speech recognition process using as noise sample signal, wherein, scene noise can be collect under multiple different scene, such as, gather the vehicle-mounted noise in vehicle traveling process, be captured in crowd noises that dining room produces or gather crowd noises that other places that the crowd is dense produce etc. as noise sample signal.And then the noise sample signal of collection is more, higher to the degree of accuracy that processes of voice signal gathered under various circumstances, the accuracy rate of speech recognition is higher.
S302, processes the clean speech sample signal prestored according to noise sample signal, obtains the noisy noise speech sample signal of tool.
Wherein, clean speech sample signal is the speech samples signal under quiet environment, does not namely comprise the voice signal of noise signal.That is, by the noise sample signal gathered, process of making an uproar is added to the speech samples signal under quiet environment, to obtain the speech samples signal under noise circumstance, i.e. noise speech sample signal.
Should be understood that, process of making an uproar is added to clean speech sample signal and can adopt existing disposal route, in order to avoid redundancy, no longer multiple superfluous herein.
S303, according to noise speech sample signal and clean speech sample signal training utterance model.
In one embodiment of the invention, also comprise according to noise speech sample signal and clean speech signal training utterance model: the acoustic feature extracting noise speech sample signal and clean speech sample signal, and set up the mapping relations of the acoustic feature of noise speech sample signal and the acoustic feature of clean speech sample signal.
Particularly, can by extracting the acoustic feature of noise speech sample signal and clean speech sample signal, and set up the mapping of acoustic feature to the acoustic feature of clean speech sample signal of noise speech sample signal by recurrent neural networks, to obtain speech model.
In speech recognition process, because speech model obtains based on noise speech sample signal, therefore, it is possible to the voice signal under noise circumstance is mapped to the voice signal under quiet environment, thus accurately identify the voice signal under noise circumstance, improve the accuracy rate of speech recognition.Simultaneously, because recurrent neural networks has stronger robustness, for the scene noise not adding training, can be good at setting up the mapping of the voice signal of voice signal to quiet environment under noise circumstance equally, thus the voice signal under accurately identifying the scene noise not adding training, improve the accuracy rate of speech recognition.
S304, obtains voice training sample signal.
Particularly, in speech recognition process, even if carry out pre-service to the voice signal gathered under noise circumstance, namely according to speech model, denoising is carried out to voice signal, but pretreated voice signal also may comprise noise signal in addition, therefore, according to recurrent neural networks, retraining is carried out to the acoustic model for speech recognition in the present embodiment, acoustic model after can making retraining thus mates more with pretreated voice signal, improves the accuracy of speech recognition further.
Wherein, voice training sample signal is the voice training data for carrying out acoustic model during retraining, and voice training sample signal is the voice signal under noise circumstance, both noise speech training data.
S305, carries out denoising according to speech model to voice training sample signal, and according to the voice training sample signal training acoustic model after denoising.
Particularly, first the acoustic feature of voice training sample is extracted, then mapped according to the acoustic feature of language model to voice training sample by recurrent neural networks, acoustic feature after process is carried out retraining to existing acoustic model, thus trains the acoustic model more mated with the acoustic feature after speech model process.
S306, gathers the voice signal of user's typing.
Particularly, by the voice signal of voice-input device as collection users such as microphones, then the voice signal of collection can be sent to service end and carry out speech recognition.Wherein, the speech model of training can be stored into speech recognition high in the clouds, the voice signal of collection be sent to high in the clouds and carry out speech recognition.
S307, judges whether voice signal comprises noise.
Particularly, service end, after the voice signal receiving user's typing, carries out SNR estimation to the voice signal of user's typing, to classify to the voice signal of user's typing.Such as, when the signal to noise ratio (S/N ratio) of the voice signal of user's typing is less than certain value, then judge that voice signal comprises noise; When the signal to noise ratio (S/N ratio) of the voice signal of user's typing is greater than certain value, then judge that voice signal does not comprise noise.
S308, if voice signal comprises noise, then carries out denoising according to speech model to voice signal.
Particularly, if judge that voice signal comprises noise, then can determine that voice signal is typing in a noisy environment, now need to carry out denoising to voice signal according to the speech model be stored in advance in service end, namely by recurrent neural networks the voice signal of user's typing converted to the voice signal under quiet environment.
In one embodiment of the invention, according to the mapping relations of noise speech sample and clean speech sample in this noise circumstance of preserving in speech model, the voice signal comprising noise gathered is converted to the voice signal not comprising noise.
S309, carries out speech recognition according to acoustic model to the voice signal after denoising.
Particularly, after the voice signal inputted user carries out denoising, speech recognition is carried out by the demoder of service end, namely the acoustic model that demoder prestores according to service end is decoded to the voice signal after denoising, voice signal is converted to text message, then recognition result is fed back to user.Wherein, acoustic model is by obtaining the training of lot of pure speech samples.
The disposal route of the voice signal of the embodiment of the present invention, by voice training sample, retraining is carried out to existing acoustic model, thus the acoustic model after retraining is mated more with pretreated voice signal, the accuracy of further raising speech recognition, improves the experience of speech-recognition services.
In order to realize above-described embodiment, the present invention also proposes a kind for the treatment of apparatus of voice signal.
Fig. 4 is the structural representation of the treating apparatus of the voice signal of one embodiment of the invention.
As shown in Figure 4, the treating apparatus of voice signal comprises: the first acquisition module 10, first processing module 20 and the first training module 30.
Wherein, the first acquisition module 10 is for acquisition noise sample signal.Particularly, first acquisition module 10 is captured in the scene noise that may occur in speech recognition process using as noise sample signal, wherein, scene noise can be collect under multiple different scene, such as, gather the vehicle-mounted noise in vehicle traveling process, be captured in crowd noises that dining room produces or gather crowd noises that other places that the crowd is dense produce etc. as noise sample signal.And then the noise sample signal that the first acquisition module 10 gathers is more, higher to the degree of accuracy that processes of voice signal gathered under various circumstances, the accuracy rate of speech recognition is higher.
First processing module 20, for processing the clean speech sample signal prestored according to noise sample signal, obtains the noisy noise speech sample signal of tool.Wherein, clean speech sample signal is the speech samples signal under quiet environment, does not namely comprise the voice signal of noise signal.That is, the first processing module 20 adds to the speech samples signal under quiet environment process of making an uproar by the noise sample signal gathered, to obtain the speech samples signal under noise circumstance, i.e. and noise speech sample signal.
First training module 30 is for according to noise speech sample signal and clean speech sample signal training utterance model.Wherein, the first training module 30 extracts the acoustic feature of noise speech sample signal and clean speech sample signal, and sets up the mapping relations of the acoustic feature of noise speech sample signal and the acoustic feature of clean speech sample signal.Particularly, first training module 30 can by extracting the acoustic feature of noise speech sample signal and clean speech sample signal, and set up the mapping of acoustic feature to the acoustic feature of clean speech sample signal of noise speech sample signal by recurrent neural networks, to obtain speech model.
The treating apparatus of the voice signal of the embodiment of the present invention, according to noise sample and the clean speech sample generted noise speech samples of different scene, and according to noise speech sample and clean speech sample training speech model, thus the voice signal that by this speech model the voice signal under various noise circumstance can be converted under quiet environment, thus substantially increase the accuracy of speech recognition in a noisy environment, promote the robustness of speech-recognition services and the experience of speech-recognition services.
Fig. 5 is the structural representation of the treating apparatus of the voice signal of the present invention's specific embodiment.
As shown in Figure 5, the treating apparatus of voice signal comprises: the first acquisition module 10, first processing module 20, first training module 30, second acquisition module 40, second processing module 50 and sound identification module 60.
Wherein, the second acquisition module 40 is for gathering the voice signal of user's typing.Particularly, the voice signal of collection by the voice signal of voice-input device as collection users such as microphones, then can be sent to service end and carry out speech recognition by the second acquisition module 40.Wherein, the speech model of training can be stored into speech recognition high in the clouds by the first training module 30, and the voice signal of collection is sent to high in the clouds and carries out speech recognition by the second acquisition module 40.
Second processing module 50, for when voice signal comprises noise, carries out denoising according to speech model to voice signal.Particularly, the second processing module 50, after the voice signal receiving the second acquisition module 40 collection, carries out SNR estimation to the voice signal of user's typing, to classify to the voice signal of user's typing.Such as, when the signal to noise ratio (S/N ratio) of the voice signal of user's typing is less than certain value, then judge that voice signal comprises noise; When the signal to noise ratio (S/N ratio) of the voice signal of user's typing is greater than certain value, then judge that voice signal does not comprise noise.If judge that voice signal comprises noise, then the second processing module 50 can determine that voice signal is typing in a noisy environment, now need to carry out denoising to voice signal according to the speech model be stored in advance in service end, namely by recurrent neural networks the voice signal of user's typing converted to the voice signal under quiet environment.
Sound identification module 60 is for carrying out speech recognition according to acoustic model to the voice signal after denoising.Wherein, sound identification module 60 also for when voice signal does not comprise noise, carries out speech recognition according to acoustic model to voice signal.Particularly, after the second processing module 50 pairs of voice signals carry out denoising, sound identification module 60 carries out speech recognition by the demoder of service end, namely the acoustic model that demoder prestores according to service end is decoded to the voice signal after denoising, voice signal is converted to text message, then recognition result is fed back to user.Wherein, acoustic model is by obtaining the training of lot of pure speech samples.
Particularly, if judge that voice signal does not comprise noise, then the second processing module 50 can determine that voice signal is typing under quiet environment, now by speech model, denoising is carried out to voice signal without the need to the second processing module 50, but sound identification module 60 is directly decoded to voice signal according to acoustic model by the demoder of service end, voice signal is converted to text message, then recognition result is fed back to user.
The treating apparatus of the voice signal of the embodiment of the present invention, in speech recognition process, pre-service is carried out to the voice signal of typing under noise circumstance, the voice signal be converted to by voice signal under quiet environment carries out speech recognition again, and speech recognition is directly carried out to the voice signal of typing under peace and quiet bad border, thus, the accuracy rate of speech recognition under quiet environment can not only be ensured, and greatly can improve the accuracy rate of speech recognition under noisy environment, thus improve the accuracy rate of speech-recognition services and robustness and service experience.
Fig. 6 is the structural representation of the treating apparatus of the voice signal of another specific embodiment of the present invention.
As shown in Figure 6, the treating apparatus of voice signal comprises: the first acquisition module 10, first processing module 20, first training module 30, second acquisition module 40, second processing module 50, sound identification module 60, acquisition module 70, the 3rd processing module 80 and the second training module 90.
Wherein, acquisition module 70 is for obtaining voice training sample signal.Wherein, voice training sample signal is the voice training data for carrying out acoustic model during retraining, and voice training sample signal is the voice signal under noise circumstance, both noise speech training data.
3rd processing module 80 is for carrying out denoising according to speech model to voice training sample signal.Second training module 90 is for training acoustic model according to the voice training sample signal after denoising.Particularly, first the 3rd processing module 80 extracts the acoustic feature of voice training sample, then mapped according to the acoustic feature of language model to voice training sample by recurrent neural networks, acoustic feature after process is carried out retraining to existing acoustic model by the second training module 90, thus trains the acoustic model more mated with the acoustic feature after speech model process.
The treating apparatus of the voice signal of the embodiment of the present invention, by voice training sample, retraining is carried out to existing acoustic model, thus the acoustic model after retraining is mated more with pretreated voice signal, the accuracy of further raising speech recognition, improves the experience of speech-recognition services.
Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, multiple step or method can with to store in memory and the software performed by suitable instruction execution system or firmware realize.Such as, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: the discrete logic with the logic gates for realizing logic function to data-signal, there is the special IC of suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.
In the present invention, unless otherwise clearly defined and limited, term " installation ", " being connected ", " connection ", etc. term should be interpreted broadly, such as, can be fixedly connected with, also can be removably connect, or integral; Can be mechanical connection, also can be electrical connection; Can be directly be connected, also indirectly can be connected by intermediary, can be the connection of two element internals or the interaction relationship of two elements, unless otherwise clear and definite restriction.For the ordinary skill in the art, above-mentioned term concrete meaning in the present invention can be understood as the case may be.
In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not must for be identical embodiment or example.And the specific features of description, structure, material or feature can combine in one or more embodiment in office or example in an appropriate manner.In addition, when not conflicting, the feature of the different embodiment described in this instructions or example and different embodiment or example can carry out combining and combining by those skilled in the art.
Although illustrate and describe embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and those of ordinary skill in the art can change above-described embodiment within the scope of the invention, revises, replace and modification.

Claims (10)

1. a disposal route for voice signal, is characterized in that, comprises the following steps:
Acquisition noise sample signal;
According to described noise sample signal, the clean speech sample signal prestored is processed, obtain the noisy noise speech sample signal of tool;
According to described noise speech sample signal and described clean speech sample signal training utterance model.
2. the disposal route of voice signal as claimed in claim 1, is characterized in that, also comprise according to described noise speech sample signal and described clean speech signal training utterance model:
Extract the acoustic feature of described noise speech sample signal and described clean speech sample signal, and set up the mapping relations of the acoustic feature of described noise speech sample signal and the acoustic feature of described clean speech sample signal.
3. the disposal route of voice signal as claimed in claim 1 or 2, is characterized in that, also comprise:
Gather the voice signal of user's typing;
When described voice signal comprises noise, according to described speech model, denoising is carried out to described voice signal;
According to acoustic model, speech recognition is carried out to the voice signal after denoising.
4. the disposal route of voice signal as claimed in claim 3, is characterized in that, also comprise:
When described voice signal does not comprise noise, according to described acoustic model, speech recognition is carried out to described voice signal.
5. the disposal route of voice signal as claimed in claim 4, is characterized in that, also comprise:
Obtain voice training sample signal;
According to described speech model, denoising is carried out to described voice training sample signal, and train described acoustic model according to the described voice training sample signal after denoising.
6. a treating apparatus for voice signal, is characterized in that, comprising:
First acquisition module, for acquisition noise sample signal;
First processing module, for processing the clean speech sample signal prestored according to described noise sample signal, obtains the noisy noise speech sample signal of tool;
First training module, for according to described noise speech sample signal and described clean speech sample signal training utterance model.
7. the treating apparatus of voice signal as claimed in claim 6, is characterized in that, described first training module also for:
Extract the acoustic feature of described noise speech sample signal and described clean speech sample signal, and set up the mapping relations of the acoustic feature of described noise speech sample signal and the acoustic feature of described clean speech sample signal.
8. the treating apparatus of voice signal as claimed in claims 6 or 7, is characterized in that, also comprise:
Second acquisition module, for gathering the voice signal of user's typing;
Second processing module, for when described voice signal comprises noise, carries out denoising according to described speech model to described voice signal;
Sound identification module, for carrying out speech recognition according to acoustic model to the voice signal after denoising.
9. the treating apparatus of voice signal as claimed in claim 8, it is characterized in that, described sound identification module also for when described voice signal does not comprise noise, carries out speech recognition according to described acoustic model to described voice signal.
10. the treating apparatus of voice signal as claimed in claim 9, is characterized in that, also comprise:
Acquisition module, for obtaining voice training sample signal;
3rd processing module, for carrying out denoising according to described speech model to described voice training sample signal;
Second training module, for training described acoustic model according to the described voice training sample signal after denoising.
CN201510866175.5A 2015-11-27 2015-11-27 Voice signal processing method and device Active CN105448303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510866175.5A CN105448303B (en) 2015-11-27 2015-11-27 Voice signal processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510866175.5A CN105448303B (en) 2015-11-27 2015-11-27 Voice signal processing method and device

Publications (2)

Publication Number Publication Date
CN105448303A true CN105448303A (en) 2016-03-30
CN105448303B CN105448303B (en) 2020-02-04

Family

ID=55558409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510866175.5A Active CN105448303B (en) 2015-11-27 2015-11-27 Voice signal processing method and device

Country Status (1)

Country Link
CN (1) CN105448303B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328126A (en) * 2016-10-20 2017-01-11 北京云知声信息技术有限公司 Far-field speech recognition processing method and device
CN106409289A (en) * 2016-09-23 2017-02-15 合肥华凌股份有限公司 Environment self-adaptive method of speech recognition, speech recognition device and household appliance
CN106557164A (en) * 2016-11-18 2017-04-05 北京光年无限科技有限公司 It is applied to the multi-modal output intent and device of intelligent robot
CN106888392A (en) * 2017-02-14 2017-06-23 广东九联科技股份有限公司 A kind of Set Top Box automatic translation system and method
CN108022596A (en) * 2017-11-28 2018-05-11 湖南海翼电子商务股份有限公司 Audio signal processing method and vehicle electronic device
CN108335694A (en) * 2018-02-01 2018-07-27 北京百度网讯科技有限公司 Far field ambient noise processing method, device, equipment and storage medium
CN108428446A (en) * 2018-03-06 2018-08-21 北京百度网讯科技有限公司 Audio recognition method and device
CN109036412A (en) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 voice awakening method and system
CN109378010A (en) * 2018-10-29 2019-02-22 珠海格力电器股份有限公司 Training method, the speech de-noising method and device of neural network model
CN109427340A (en) * 2017-08-22 2019-03-05 杭州海康威视数字技术股份有限公司 A kind of sound enhancement method, device and electronic equipment
CN109616100A (en) * 2019-01-03 2019-04-12 百度在线网络技术(北京)有限公司 The generation method and its device of speech recognition modeling
CN110503967A (en) * 2018-05-17 2019-11-26 中国移动通信有限公司研究院 A kind of sound enhancement method, device, medium and equipment
CN110570845A (en) * 2019-08-15 2019-12-13 武汉理工大学 Voice recognition method based on domain invariant features
CN110875050A (en) * 2020-01-17 2020-03-10 深圳亿智时代科技有限公司 Voice data collection method, device, equipment and medium for real scene
CN111081223A (en) * 2019-12-31 2020-04-28 广州市百果园信息技术有限公司 Voice recognition method, device, equipment and storage medium
CN111243573A (en) * 2019-12-31 2020-06-05 深圳市瑞讯云技术有限公司 Voice training method and device
CN111354374A (en) * 2020-03-13 2020-06-30 北京声智科技有限公司 Voice processing method, model training method and electronic equipment
CN111862945A (en) * 2019-05-17 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112201227A (en) * 2020-09-28 2021-01-08 海尔优家智能科技(北京)有限公司 Voice sample generation method and device, storage medium and electronic device
CN112259113A (en) * 2020-09-30 2021-01-22 清华大学苏州汽车研究院(相城) Preprocessing system for improving accuracy rate of speech recognition in vehicle and control method thereof
CN113053404A (en) * 2021-03-22 2021-06-29 三一重机有限公司 Method and device for interaction between inside and outside of cab

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1331467A (en) * 2000-06-28 2002-01-16 松下电器产业株式会社 Method and device for producing acoustics model
EP1199708A2 (en) * 2000-10-16 2002-04-24 Microsoft Corporation Noise robust pattern recognition
US6633842B1 (en) * 1999-10-22 2003-10-14 Texas Instruments Incorporated Speech recognition front-end feature extraction for noisy speech
CN1584984A (en) * 2003-08-19 2005-02-23 微软公司 Method of noise reduction using instantaneous signal-to-noise ratio as the principal quantity for optimal estimation
US20060293887A1 (en) * 2005-06-28 2006-12-28 Microsoft Corporation Multi-sensory speech enhancement using a speech-state model
CN101154383A (en) * 2006-09-29 2008-04-02 株式会社东芝 Method and device for noise suppression, phonetic feature extraction, speech recognition and training voice model
US20110077939A1 (en) * 2009-09-30 2011-03-31 Electronics And Telecommunications Research Institute Model-based distortion compensating noise reduction apparatus and method for speech recognition
CN103000174A (en) * 2012-11-26 2013-03-27 河海大学 Feature compensation method based on rapid noise estimation in speech recognition system
CN104485103A (en) * 2014-11-21 2015-04-01 东南大学 Vector Taylor series-based multi-environment model isolated word identifying method
CN104900232A (en) * 2015-04-20 2015-09-09 东南大学 Isolation word identification method based on double-layer GMM structure and VTS feature compensation

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633842B1 (en) * 1999-10-22 2003-10-14 Texas Instruments Incorporated Speech recognition front-end feature extraction for noisy speech
CN1331467A (en) * 2000-06-28 2002-01-16 松下电器产业株式会社 Method and device for producing acoustics model
EP1199708A2 (en) * 2000-10-16 2002-04-24 Microsoft Corporation Noise robust pattern recognition
CN1584984A (en) * 2003-08-19 2005-02-23 微软公司 Method of noise reduction using instantaneous signal-to-noise ratio as the principal quantity for optimal estimation
US20060293887A1 (en) * 2005-06-28 2006-12-28 Microsoft Corporation Multi-sensory speech enhancement using a speech-state model
CN101154383A (en) * 2006-09-29 2008-04-02 株式会社东芝 Method and device for noise suppression, phonetic feature extraction, speech recognition and training voice model
US20110077939A1 (en) * 2009-09-30 2011-03-31 Electronics And Telecommunications Research Institute Model-based distortion compensating noise reduction apparatus and method for speech recognition
CN103000174A (en) * 2012-11-26 2013-03-27 河海大学 Feature compensation method based on rapid noise estimation in speech recognition system
CN104485103A (en) * 2014-11-21 2015-04-01 东南大学 Vector Taylor series-based multi-environment model isolated word identifying method
CN104900232A (en) * 2015-04-20 2015-09-09 东南大学 Isolation word identification method based on double-layer GMM structure and VTS feature compensation

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106409289B (en) * 2016-09-23 2019-06-28 合肥美的智能科技有限公司 Environment self-adaption method, speech recognition equipment and the household electrical appliance of speech recognition
WO2018054361A1 (en) * 2016-09-23 2018-03-29 合肥华凌股份有限公司 Environment self-adaptive method of speech recognition, speech recognition device, and household appliance
CN106409289A (en) * 2016-09-23 2017-02-15 合肥华凌股份有限公司 Environment self-adaptive method of speech recognition, speech recognition device and household appliance
CN106328126A (en) * 2016-10-20 2017-01-11 北京云知声信息技术有限公司 Far-field speech recognition processing method and device
CN106557164A (en) * 2016-11-18 2017-04-05 北京光年无限科技有限公司 It is applied to the multi-modal output intent and device of intelligent robot
CN106888392A (en) * 2017-02-14 2017-06-23 广东九联科技股份有限公司 A kind of Set Top Box automatic translation system and method
CN109427340A (en) * 2017-08-22 2019-03-05 杭州海康威视数字技术股份有限公司 A kind of sound enhancement method, device and electronic equipment
CN108022596A (en) * 2017-11-28 2018-05-11 湖南海翼电子商务股份有限公司 Audio signal processing method and vehicle electronic device
US11087741B2 (en) 2018-02-01 2021-08-10 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, device and storage medium for processing far-field environmental noise
CN108335694A (en) * 2018-02-01 2018-07-27 北京百度网讯科技有限公司 Far field ambient noise processing method, device, equipment and storage medium
CN108428446B (en) * 2018-03-06 2020-12-25 北京百度网讯科技有限公司 Speech recognition method and device
CN108428446A (en) * 2018-03-06 2018-08-21 北京百度网讯科技有限公司 Audio recognition method and device
US10978047B2 (en) 2018-03-06 2021-04-13 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for recognizing speech
CN110503967A (en) * 2018-05-17 2019-11-26 中国移动通信有限公司研究院 A kind of sound enhancement method, device, medium and equipment
CN110503967B (en) * 2018-05-17 2021-11-19 中国移动通信有限公司研究院 Voice enhancement method, device, medium and equipment
CN109036412A (en) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 voice awakening method and system
CN109378010A (en) * 2018-10-29 2019-02-22 珠海格力电器股份有限公司 Training method, the speech de-noising method and device of neural network model
CN109616100A (en) * 2019-01-03 2019-04-12 百度在线网络技术(北京)有限公司 The generation method and its device of speech recognition modeling
CN111862945A (en) * 2019-05-17 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
CN110570845A (en) * 2019-08-15 2019-12-13 武汉理工大学 Voice recognition method based on domain invariant features
CN110570845B (en) * 2019-08-15 2021-10-22 武汉理工大学 Voice recognition method based on domain invariant features
CN111243573A (en) * 2019-12-31 2020-06-05 深圳市瑞讯云技术有限公司 Voice training method and device
CN111081223A (en) * 2019-12-31 2020-04-28 广州市百果园信息技术有限公司 Voice recognition method, device, equipment and storage medium
CN111081223B (en) * 2019-12-31 2023-10-13 广州市百果园信息技术有限公司 Voice recognition method, device, equipment and storage medium
CN110875050A (en) * 2020-01-17 2020-03-10 深圳亿智时代科技有限公司 Voice data collection method, device, equipment and medium for real scene
CN111354374A (en) * 2020-03-13 2020-06-30 北京声智科技有限公司 Voice processing method, model training method and electronic equipment
CN112201227A (en) * 2020-09-28 2021-01-08 海尔优家智能科技(北京)有限公司 Voice sample generation method and device, storage medium and electronic device
CN112259113A (en) * 2020-09-30 2021-01-22 清华大学苏州汽车研究院(相城) Preprocessing system for improving accuracy rate of speech recognition in vehicle and control method thereof
CN113053404A (en) * 2021-03-22 2021-06-29 三一重机有限公司 Method and device for interaction between inside and outside of cab

Also Published As

Publication number Publication date
CN105448303B (en) 2020-02-04

Similar Documents

Publication Publication Date Title
CN105448303A (en) Voice signal processing method and apparatus
CN107068161B (en) Speech noise reduction method and device based on artificial intelligence and computer equipment
JP6393730B2 (en) Voice identification method and apparatus
CN108665895B (en) Method, device and system for processing information
WO2018068396A1 (en) Voice quality evaluation method and apparatus
JP6099556B2 (en) Voice identification method and apparatus
EP2643981B1 (en) A device comprising a plurality of audio sensors and a method of operating the same
CN103578468B (en) The method of adjustment and electronic equipment of a kind of confidence coefficient threshold of voice recognition
US9570072B2 (en) System and method for noise reduction in processing speech signals by targeting speech and disregarding noise
CN110047481B (en) Method and apparatus for speech recognition
CN106328151B (en) ring noise eliminating system and application method thereof
CN109036412A (en) voice awakening method and system
CN108335694B (en) Far-field environment noise processing method, device, equipment and storage medium
CN103229238A (en) System and method for producing an audio signal
CN105788603A (en) Audio identification method and system based on empirical mode decomposition
CN105206271A (en) Intelligent equipment voice wake-up method and system for realizing method
EP3671743B1 (en) Voice activity detection method
KR20160032138A (en) Speech signal separation and synthesis based on auditory scene analysis and speech modeling
CN205508398U (en) Intelligent robot with high in clouds interactive function
CN103971681A (en) Voice recognition method and system
CN104811559A (en) Noise reduction method, communication method and mobile terminal
CN111415653B (en) Method and device for recognizing speech
CN110992967A (en) Voice signal processing method and device, hearing aid and storage medium
CN107274895B (en) Voice recognition device and method
CN110428835A (en) A kind of adjusting method of speech ciphering equipment, device, storage medium and speech ciphering equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant