CN114664288A - Voice recognition method, device, equipment and storage medium - Google Patents

Voice recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN114664288A
CN114664288A CN202011524726.7A CN202011524726A CN114664288A CN 114664288 A CN114664288 A CN 114664288A CN 202011524726 A CN202011524726 A CN 202011524726A CN 114664288 A CN114664288 A CN 114664288A
Authority
CN
China
Prior art keywords
voice
domain
data
network
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011524726.7A
Other languages
Chinese (zh)
Inventor
肖龙帅
杨占磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202011524726.7A priority Critical patent/CN114664288A/en
Publication of CN114664288A publication Critical patent/CN114664288A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the field of voice recognition. The method comprises the following steps: firstly, respectively collecting voice data under a non-echo scene as voice sample data of a source domain and voice data under an echo scene as voice sample data of a target domain, training the voice recognition network by using the voice sample data of the source domain, and simultaneously training the feature extraction network and the domain classification network with an antagonistic relation by using the voice sample data of the source domain and the voice sample data of the target domain, so that the features extracted by the feature extraction network are the features which are unchanged in domain and can be identified by the voice recognition network, and finally obtaining the voice recognition model with robustness under the echo scene.

Description

Voice recognition method, device, equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a speech recognition method, apparatus, device, and storage medium.
Background
With the continuous development of computer science technologies, especially Artificial Intelligence (AI) technologies, speech recognition technologies have begun to move from laboratories to markets, and are applied in more and more fields, such as industrial control, smart homes, intelligent toys, and speech control of terminal devices. The voice recognition technology enables the information to be acquired and processed more conveniently, improves the working efficiency of users, and brings convenience to the life of people.
But the accuracy of speech recognition is degraded due to the existence of echo phenomenon. Echo phenomenon as shown in fig. 1, in a closed space 1, such as a room, a speaker 21 of a speech recognition device 20 plays audio, an audio signal is reflected by a wall or the like in the room via a plurality of paths and is received by a microphone 22 of a speech recognition terminal again, and an echo signal received by a speech recognition module 23 submerges a speech of a target user to be picked up in the microphone, so that the speech of the user cannot be accurately recognized by the speech recognition terminal.
Therefore, it is highly desirable to develop a speech recognition device that is not affected by the echo phenomenon.
Content of application
Embodiments of the present application provide a speech recognition method, apparatus, device, and storage medium, which enhance robustness of a speech recognition model in an echo scene, so that a speech recognition device having the speech recognition model is not affected by an echo phenomenon.
In a first aspect, the present application provides a speech recognition method, including first obtaining speech sample data of a source domain including a source domain speech sample, a text label of the source domain speech sample, and a domain label of the source domain, and speech sample data of a target domain including a target domain speech sample and a domain label of the target domain, respectively; wherein the source domain speech sample does not include echo data and the target domain speech sample includes echo data; then extracting the characteristics of the source domain voice sample based on the characteristic extraction network to obtain a first voice characteristic, and extracting the characteristics of the target domain voice sample to obtain a second voice characteristic; inputting the first voice characteristic as a sample characteristic into a voice recognition network to obtain a voice recognition result; inputting the first voice characteristic and the second voice characteristic into a domain classification network as sample characteristics to obtain a domain classification result; performing combined training on the feature extraction network, the voice recognition network and the domain classification network according to the voice recognition result and the text label and the domain classification result and the domain label to obtain a trained feature extraction network and a trained voice recognition network as voice recognition models; and finally, inputting the voice data to be recognized into the trained voice recognition model to obtain a voice recognition result.
The voice recognition method comprises the steps of training an acoustic model by adopting an antagonistic multitask training mode, wherein a network structure comprises a feature extraction network, a voice recognition network and a domain classification network, firstly, voice data under a non-echo scene are respectively collected as voice sample data of a source domain and voice data under an echo scene are respectively collected as voice sample data of a target domain, the voice recognition network is trained by utilizing the voice sample data of the source domain, meanwhile, the feature extraction network and the domain classification network with an antagonistic relation are trained by utilizing the voice sample data of the source domain and the voice sample data of the target domain, so that the features extracted by the feature extraction network are the features which are unchanged in domain and can be recognized by the voice recognition network, and finally, the voice recognition model with robustness under the echo scene is obtained.
In one possible implementation, the domain classification network includes a gradient inversion layer and a domain classification layer; and the gradient inversion layer enables the feature extraction network and the domain classification layer to form a confrontation relation, domain labels of a source domain voice sample and a source domain, domain labels of a target domain voice sample and a target domain are respectively input into the feature extraction network, and the feature extraction network and the domain classification layer are trained.
In one possible implementation, the training the feature extraction network and the domain classification layer includes: in the forward propagation training process, the voice features extracted by the feature extraction network are input into the domain classification layer through a gradient inversion layer, and the domain classification layer updates the parameters of the domain classification layer according to the domain classification result and the domain label; and in the back propagation training process, the gradient inversion layer inverts the gradient of the domain classification layer and then transmits the inverted gradient to the feature extraction network so as to update the parameters of the feature extraction network.
In another possible implementation, the jointly training the feature extraction network, the speech recognition network, and the domain classification network includes: carrying out weighted summation on the loss function of the voice recognition network and the loss function of the domain classification network to obtain a total loss function; and performing joint training on the feature extraction network, the voice recognition network and the domain classification network by minimizing a total loss function.
In another possible implementation, the loss function of the domain classification network is a cross-entropy loss function, or a KL divergence loss function; or the domain label comprises a soft label, and the loss function of the domain classification network is a uniform distribution function obtained based on the soft label.
In another possible implementation, the obtaining voice sample data of the source domain and voice sample data of the target domain includes: collecting voice data, and judging whether the voice data comprises far-end voice data; if so, taking the voice data as a voice sample of a target domain, and labeling a domain label of the target domain; if not, the voice data is used as a source domain voice sample, and the text label and the domain label of the source domain are labeled.
In another possible implementation, the determining whether the voice data includes far-end voice data includes: judging whether the far-end voice energy in the voice data is larger than a first preset threshold value or not; if yes, the remote voice data is included; if not, the remote voice data is not included.
In another possible implementation, the extracting, by the feature-based extraction network, the features of the source-domain speech sample to obtain the first speech feature further includes: judging whether the voice sample data of the source domain and the voice sample data of the target domain are greater than or equal to a preset number, and judging whether the difference between the confidence coefficient of the voice recognition result output by the voice recognition network for the voice sample of the source domain and the confidence coefficient of the voice recognition result output by the voice recognition network for the voice sample of the target domain is greater than or equal to a second preset threshold; and if so, executing the feature extraction network based on the features to extract the features of the source domain voice sample to obtain a first voice feature.
In another possible implementation, the target domain includes a plurality of target domains determined based on signal-to-echo energy ratios.
In another possible implementation, the echo data includes at least: one or more of audio data processed by the echo cancellation module, far-end voice data, echo estimation data of the echo cancellation module, residual echo data of the echo suppression module, and echo path data estimated by the echo cancellation module.
In another possible implementation, the training process of the speech recognition model is completed in a cloud server, and the text labels of the source domain speech samples are obtained based on manual labeling.
In another possible implementation, the training process of the speech recognition model is completed at the terminal device; the obtaining of the voice sample data of the source domain and the voice sample data of the target domain includes: the terminal equipment collects voice data and judges whether the voice data comprises far-end voice data, if so, the voice data is used as a target domain voice sample and is labeled with a domain label of a target domain, and if not, the voice data is used as a source domain voice sample and is labeled with a domain label of a source domain; and the text label of the source domain voice sample is obtained based on the recognition of a preset acoustic model in the terminal equipment.
On one hand, the model is automatically completed by iterative optimization of the model aiming at different degrees without offline data collection and updating the model into user equipment, on the other hand, the data is collected at the end side, and the data is not required to be marked by a user, so that the offline red marking and cloud coating of the user data are not involved, the personal data of the user is ensured not to leave the equipment of the user, and the privacy of the user is protected; on the other hand, personalized optimization is carried out aiming at different equipment states and use scenes of different users, and the voice recognition performance of the user equipment in an echo scene is improved.
In another possible implementation, the voice recognition result is the voice content of the voice data to be recognized, or whether the voice data to be recognized contains a wakeup word, or whether the voiceprint feature of the voice data to be recognized matches a preset voiceprint feature.
In a second aspect, the present application further provides a speech recognition apparatus, including: the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring voice sample data of a source domain and voice sample data of a target domain, the voice sample data of the source domain comprises a voice sample of the source domain, a text label of the voice sample of the source domain and a domain label of the source domain, and the voice sample data of the target domain comprises a voice sample of the target domain and a domain label of the target domain; wherein the partial distribution characteristics of the source domain can be migrated and learned to a target domain, the source domain voice samples do not include echo data, and the target domain voice samples include echo data; the extraction module is used for extracting the characteristics of the source domain voice sample to obtain first voice characteristics and extracting the characteristics of the target domain voice sample to obtain second voice characteristics based on the characteristic extraction network; the training module is used for inputting the first voice characteristic serving as a sample characteristic into a voice recognition network to obtain a voice recognition result; inputting the first voice characteristic and the second voice characteristic into a domain classification network as sample characteristics to obtain a domain classification result; performing combined training on the feature extraction network, the voice recognition network and the domain classification network according to the voice recognition result and the text label and the domain classification result and the domain label to obtain a trained feature extraction network and a trained voice recognition network as voice recognition models; and the recognition module is used for inputting the voice data to be recognized into the trained voice recognition model to obtain a voice recognition result.
In one possible implementation, the domain classification network includes a gradient inversion layer and a domain classification layer; and the gradient inversion layer enables the feature extraction network and the domain classification layer to form a confrontation relation, domain labels of a source domain voice sample and a source domain, domain labels of a target domain voice sample and a target domain are respectively input into the feature extraction network, and the feature extraction network and the domain classification layer are trained.
In another possible implementation, the training the feature extraction network and domain classification layer includes: in the forward propagation training process, the voice features extracted by the feature extraction network are input into the domain classification layer through a gradient inversion layer, and the domain classification layer updates the parameters of the domain classification layer according to the domain classification result and the domain label; and in the back propagation training process, the gradient inversion layer inverts the gradient of the domain classification layer and then transmits the inverted gradient to the feature extraction network so as to update the parameters of the feature extraction network.
In another possible implementation, the training module is specifically configured to: carrying out weighted summation on the loss function of the voice recognition network and the loss function of the domain classification network to obtain a total loss function; and performing joint training on the feature extraction network, the voice recognition network and the domain classification network by minimizing a total loss function.
In another possible implementation, the loss function of the domain classification network is a cross-entropy loss function, or a KL divergence loss function; or the domain label comprises a soft label, and the loss function of the domain classification network is a uniform distribution function obtained based on the soft label.
In another possible implementation, the obtaining module is specifically configured to: collecting voice data, and judging whether the voice data comprises remote voice data; if so, taking the voice data as a target domain voice sample, and labeling a domain label of the target domain; if not, the voice data is used as a source domain voice sample, and the text label and the domain label of the source domain are labeled.
In another possible implementation, the determining whether the voice data includes far-end voice data includes: judging whether the far-end voice energy in the voice data is larger than a first preset threshold value or not; if yes, the remote voice data is included; if not, the remote voice data is not included.
In another possible implementation, the apparatus further includes: the judging module is used for judging whether the voice sample data of the source domain and the voice sample data of the target domain are greater than or equal to a preset number, and the difference between the confidence coefficient of the voice recognition result output by the voice recognition network for the voice sample of the source domain and the confidence coefficient of the voice recognition result output by the voice recognition network for the voice sample of the target domain is greater than or equal to a second preset threshold; and if so, executing the feature extraction network based on the features to extract the features of the source domain voice sample to obtain a first voice feature.
In another possible implementation, the target domain includes a plurality of target domains determined based on signal-to-echo energy ratios.
In another possible implementation, the echo data includes at least: one or more of audio data processed by the echo cancellation module, far-end voice data, echo estimation data of the echo cancellation module, residual echo data of the echo suppression module, and echo path data estimated by the echo cancellation module.
In another possible implementation, the training process of the speech recognition model is completed in a cloud server, and the text labels of the source domain speech samples are obtained based on manual labeling.
In another possible implementation, the training process of the speech recognition model is completed at the terminal device; the acquiring of the voice sample data of the source domain and the voice sample data of the target domain comprises: the terminal equipment collects voice data and judges whether the voice data comprises far-end voice data, if so, the voice data is used as a target domain voice sample and is labeled with a domain label of a target domain, and if not, the voice data is used as a source domain voice sample and is labeled with a domain label of a source domain; and the text label of the source domain voice sample is obtained based on the recognition of a preset acoustic model in the terminal equipment.
In another possible implementation, the voice recognition result is the voice content of the voice data to be recognized, or whether the voice data to be recognized contains a wakeup word, or whether the voiceprint feature of the voice data to be recognized matches a preset voiceprint feature.
In a third aspect, the present application further provides a speech recognition device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method described in the first aspect or any one of the possible implementation manners of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium having a computer program stored thereon, which, when executed on a computer, causes the computer to perform the method of the first aspect or any one of the possible implementation manners of the first aspect.
Drawings
FIG. 1 is a schematic diagram of echo phenomenon;
FIG. 2a is a histogram of word error rate of speech recognition as a function of echo energy ratio in an echo scenario;
FIG. 2b is a schematic diagram of a first embodiment of a speech recognition apparatus;
FIG. 2c is a schematic diagram illustrating the echo cancellation in the second scheme;
fig. 2d is a schematic diagram of residual echo estimation in a third scheme;
FIG. 2e is a schematic block diagram of a speech recognition device according to a third aspect;
FIG. 3 is a schematic structural diagram of a speech recognition model in a training process according to an embodiment of the present application;
FIG. 4 is a flowchart of a speech recognition method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a domain classification network in an embodiment of the present application;
fig. 6 is a schematic diagram of a training process of a speech recognition model in a cloud server in an embodiment of the present application;
FIG. 7 is a diagram illustrating a training process of a speech recognition model in a speech recognition apparatus according to an embodiment of the present application;
FIG. 8 is an architecture diagram of a speech recognition system provided by an embodiment of the present application;
fig. 9 is an architecture diagram of a voice wake-up system according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application.
Detailed Description
The technical solution of the present application is further described in detail by the accompanying drawings and examples.
The voice recognition interaction scene under the echo scene comprises: non-wake-up interruption, specific person wake-up, multi-language voice recognition, and the like.
For example, when a voice robot or a smart speaker plays music or tells a story, a user wants the device to recognize voice commands such as ' don't tell ', ' next ' and the like to operate the device, so that wake-up interruption is avoided.
For example, in the case of a voice robot or a smart speaker playing music or telling a story, the user may wish to interrupt the playing by a wakeup word, such as "wakeup word one", to interrupt the playing.
For example, when the voice robot or the smart speaker plays music or tells a story, only a specific person can wake up and interrupt, so that the specific person can wake up.
For example, a voice robot or an intelligent sound box can play background music, and a user can accurately recognize the speaking content when speaking Chinese or other languages (such as English), so that multi-language voice recognition is realized.
The echo phenomenon seriously affects the speech recognition accuracy of the speech recognition apparatus. As shown in fig. 2a, as the echo energy ratio of the voice data decreases (a smaller echo energy ratio represents a stronger echo), the Word Error Rate (WER) of the voice recognition significantly increases.
Therefore, in order to realize the voice recognition function in the echo scene, the problem that the echo phenomenon affects the accuracy of voice recognition needs to be solved.
The first solution is to add an echo cancellation module, as shown in fig. 2b, add an echo cancellation module 24 in the speech recognition device 2, so that the audio data picked up by the microphone is processed by the echo cancellation module 24, and then enters the speech recognition module 23 for speech recognition.
Although the echo problem under most voice recognition interaction scenes is solved through the echo cancellation module, for the intelligent voice interaction scenes, the usability of the voice recognition module can be remarkably reduced due to the fact that near-end voice distortion or far-end voice has remarkable residue caused by the challenges of nonlinearity of a low-cost loudspeaker, strong echo level caused by far-field voice interaction, correlation among multi-channel channels and the like, in addition, along with the increase of the using time of user equipment, the nonlinear problems of different degrees occur to different user equipment loudspeakers, and the performance of a related voice algorithm is difficult to optimize by a unified offline algorithm according to different nonlinear degrees.
The second scheme is an echo cancellation scheme based on signal processing, and the echo path is estimated by using self-adaptive filtering through linear modeling of the echo path, so that the user voice after echo control is obtained; and for the residual echo, a wiener filtering method is adopted for processing, and the echo residual is further suppressed.
Specifically, as shown in fig. 2c, on the basis of the echo cancellation module 23, a residual echo suppression module 25 is added, and a microphone signal including an echo signal is modeled as follows:
d=s+y.
the microphone signal after processing by the echo cancellation algorithm is represented as:
Figure BDA0002850201910000061
the residual echo suppression module adopting wiener filtering is designed as follows:
Figure BDA0002850201910000062
the residual echo was estimated as:
Figure BDA0002850201910000063
although echoes of partial scenes are effectively inhibited by the scheme, the processing effect on nonlinear distortion and strong echo scenes is poor, so that the strong echoes are high-frequency scenes due to the fact that the loudspeaker has obvious nonlinear distortion and the loudspeaker is close to the microphone for voice interaction products such as intelligent sound boxes, robots and mobile phones, and finally the processing scheme is difficult to meet interaction requirements in the voice interaction products.
In addition, the scheme can bring serious nonlinear distortion, and voice interaction functions, including voice recognition, awakening, speaker recognition, language recognition and the like, are very sensitive to the nonlinear distortion, so that the voice interaction performance is influenced.
The third scheme is to learn the residual echo through a neural network from the deep learning point of view, as shown in fig. 2d, where the input of the network is the output of the echo cancellation module, and the output of the network is the estimation of the residual echo.
The adoption of the neural network mode can effectively inhibit the residual echo by utilizing the strong learning capability of the neural network.
On one hand, the neural network learning needs a large amount of paired labeled data, the acquisition cost is high, and the residual echo output by the echo cancellation module is weaker than the voice of a near-end user under normal conditions, so that the residual echo is difficult to effectively estimate; on the other hand, the learning target of the scheme is irrelevant to the voice recognition, and the accuracy of the voice recognition cannot be guaranteed.
A fourth solution is to directly feed the original microphone signal, the residual echo signal, the echo estimation signal, the reference signal, etc. to the speech recognition model, as shown in fig. 2 e. And training the voice recognition model through the labeled training data, thereby improving the voice recognition effect in an echo scene.
According to the scheme, on one hand, the processing effect of an echo scene can be improved by utilizing the neural network, on the other hand, the voice recognition model is trained by adopting the data multi-condition of voice recognition, and the problem that the voice recognition effect cannot be improved in the first three schemes can be avoided. However, because a large amount of labeled data is needed for training the speech recognition model, a large training cost is needed, and in addition, a multi-condition training mode cannot necessarily ensure the robustness of the speech recognition model to the residual echo, and the improvement of the speech recognition effect in the echo scene is limited.
Fig. 3 is a schematic structural diagram of a speech recognition model in a training process according to an embodiment of the present application. As shown in fig. 3, a first branch consisting of the feature extraction network 30 and the speech recognition network 31 and a second branch consisting of the feature extraction network 30 and the domain discrimination network 32 are included.
Firstly, voice data in a non-echo scene is acquired as voice sample data of a source domain, and voice data in an echo scene is acquired as voice sample data of a target domain.
Then, inputting the voice sample data of the source domain into a first branch training voice recognition network and a feature extraction network, carrying out parameter adjustment on the feature extraction network and the voice recognition network, inputting the voice sample data of the source domain and the voice sample data of the target domain into a second branch training domain discrimination network and a feature extraction network, carrying out parameter adjustment on the discrimination network and the feature extraction network, and enabling the features extracted by the feature extraction network to be the features which are unchanged in domain and can be identified by the voice recognition network.
And finally, performing voice recognition on voice data to be recognized by using the trained feature extraction network and the trained voice recognition network as voice recognition models to obtain the voice recognition model with better robustness in a non-echo scene.
It will be appreciated that a "model" as referred to herein may learn from training data the associations between respective inputs and outputs, such that after training is complete, for a given input, a corresponding output may be generated. The "model" may also be referred to as a "neural network", "learning model" or "learning network", etc.
Fig. 4 is a flowchart of a speech recognition method according to an embodiment of the present application. As shown in fig. 4, the speech recognition method includes the steps of:
s401, acquiring voice sample data of a source domain and voice sample data of a target domain.
The voice sample data of the source domain comprises a voice sample of the source domain, a text label of the voice sample of the source domain and a domain label of the source domain, and the voice sample data of the target domain comprises a voice sample of the target domain and a domain label of the target domain; the source domain speech samples do not include echo data and the target domain speech samples include echo data. For example, the voice data collected in the non-echo scenario may be used as the source domain voice sample, and the voice data collected in the non-echo scenario may be used as the target domain voice sample. The method and the device provide two modeling modes of the source domain and the target domain, and training data can be conveniently obtained.
It should be explained that the echo scene is a scene in which an echo phenomenon is generated in a closed or partially closed space (e.g., a room) after an audio signal played by a speaker of the speech recognition device encounters an obstacle and is received by a sound pickup assembly of the speech recognition device through various paths, and the non-echo scene is a scene in which no echo phenomenon is generated in the speaker of the speech recognition device or in the non-closed space.
The voice samples of the source domain and the voice samples of the target domain may be acquired through a pickup component configured in the voice recognition device, for example, a microphone or an array microphone, that is, the present application does not limit the acquisition of the voice samples to a single channel or multiple channels. Or may be acquired by data transmission.
The voice recognition device can be a device provided with a pickup assembly, such as an intelligent sound box, a smart phone, a tablet computer, a notebook computer, a palm computer, a personal digital assistant and an intelligent wearable device.
Because the voice sample of the source domain has no echo data, the problem that the echo influences the recognition accuracy of the voice recognition model does not exist, and the voice sample can be automatically labeled after being recognized by the existing voice recognition model, the text label of the voice sample of the source domain is easy to obtain, the first branch can be trained in a supervised training mode, the voice sample of the target domain comprises the echo data, the problem that the echo influences the recognition accuracy of the voice recognition model does not exist, the voice sample can not be automatically labeled by the existing voice recognition model, all the voice samples need manual labeling, the workload is huge, and the second branch is trained in an unsupervised training mode.
For example, the labeling of the text label of the source domain speech sample may be automatically labeled after being recognized by a preset acoustic model in the speech recognition device, may also be automatically labeled after being recognized by any device with a speech recognition function, or training data of other acoustic models that have been manually labeled, which is not limited in the present application.
In one example, a voice sample of a source domain and a voice sample of a target domain are input into a feature extraction network in batches, and a domain label of the source domain and a domain label of the target domain are automatically labeled when input. Or, judging whether the echo data automatic labeling domain label exists at the time of inputting, namely judging whether the echo data automatic labeling domain label exists in the echo data automatic labeling target domain, and judging whether the echo data automatic labeling domain label does not exist in the echo data automatic labeling source domain label.
S402, extracting the characteristics of the source domain voice sample to obtain a first voice characteristic based on the characteristic extraction network, and extracting the characteristics of the target domain voice sample to obtain a second voice characteristic.
The first speech feature and the second speech feature are acoustic features, typically spectral features of speech data, such as Mel Frequency Cepstral Coefficient (MFCC) features, fbank (filterbank) features, and the like.
The first speech feature and the second speech feature extracted by the feature extraction network are FBank features. The FBank feature is a front-end processing algorithm, which processes audio in a manner similar to human ears, and can improve the performance of speech recognition. The extraction process mainly comprises fourier transform, energy spectrum calculation, Mel filtering and Log value taking, and for the sake of simplicity, detailed description is omitted here, and specific reference may be made to the FBank feature extraction process in the prior art.
In general, the purpose of extracting FBank features is to reduce the dimensionality of audio data. For example, an audio file with a length of one second, the number of bits of the audio file converted into an array is very long if a sampling rate of 16000 samples/sec is used. By extracting the FBank feature, the length of the audio file of one frame can be reduced to 80 bits.
The feature extraction network may specifically be a deep Neural network such as a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN).
S403, inputting the first voice characteristic as a sample characteristic into a voice recognition network to obtain a voice recognition result; and inputting the first voice characteristic and the second voice characteristic serving as sample characteristics into a domain classification network to obtain a domain classification result.
In one example, the speech recognition network and the domain classification network may be Neural Networks such as Long Short Term Memory (LSTM), Deep Neural Networks (DNN), and Convolutional Neural Networks (CNN).
For example, the speech recognition network is a DNN with 6 hidden layers, the output node 1936, the hidden layer 1024, and the domain classification network is a hidden layer full-connection network.
The speech recognition network may be an existing speech recognition network trained in a non-echo scenario, or an untrained speech recognition network, which is not limited in this application. The voice recognition network outputs a voice recognition result according to the first voice characteristic, and the domain classification network outputs a domain classification result according to the first voice characteristic and the second voice characteristic.
S404, performing combined training on the feature extraction network, the voice recognition network and the domain classification network according to the voice recognition result and the text label and according to the domain classification result and the domain label to obtain a trained feature extraction network and a trained voice recognition network as voice recognition models.
In one example, as shown in fig. 5, the domain classification network 32 includes a gradient inversion layer 321 and a domain classification layer 322, and during the training process, the forward propagation process is included, that is, the speech features extracted by the feature extraction network are input into the domain classification layer through the gradient inversion layer, and the domain classification layer outputs the domain classification result, and the signal propagation direction thereof is from the feature extraction network to the domain classification layer; and the back propagation is the process of returning the domain classification result output by the domain classification layer and the error of the domain label to the feature extraction network, and the signal propagation direction is from the domain classification layer to the feature extraction network.
During the forward propagation, the gradient inversion layer passes the first speech feature and/or the second speech feature directly to the domain classification layer without any processing. And in the back propagation process, the gradient inversion layer inverts the gradient of the classification layer and transmits the inverted gradient to the feature extraction network. The updating direction of the feature extraction network is opposite to the updating direction of the domain classification network, namely the training target of the domain classification network is the field to which the voice sample belongs as accurately as possible, and the training target of the feature extraction network is the field to which the extracted voice feature makes the domain classification network not recognize the voice sample as possible, so that an antagonistic relation is formed between the feature extraction network and the domain classification network through a gradient inversion layer, the domain invariant feature extracted by the feature extraction network is finally trained, and the voice recognition model still has good robustness in an echo scene.
Meanwhile, the voice recognition network in the first branch extracts voice features recognizable by the voice recognition network according to the voice recognition result and the text label training features of the voice samples of the source domain.
And of course, aligning the voice sample of the source domain and the text label of the voice sample of the source domain so as to facilitate the training of the first branch, and extracting the voice features recognizable by the voice recognition network according to the voice recognition result and the text label training features of the voice sample of the source domain corresponding to the voice sample of the source domain.
In summary, according to the speech recognition result and the text label, and according to the domain classification result and the domain label, the feature extraction network, the speech recognition network and the domain classification network are jointly trained, and finally the trained feature extraction network and the trained speech recognition network are obtained as the speech recognition model.
Specifically, the loss function of the voice recognition network and the loss function of the domain classification network are weighted and summed to obtain a total loss function, and the feature extraction network, the voice recognition network and the domain classification network are jointly trained by minimizing the total loss function. And summing the total loss functions of multiple times of training to obtain a total cost function.
The cost function is:
Figure BDA0002850201910000081
wherein E (theta)f,θy,θd) Representing a cost function, θfParameter, theta, representing a feature extraction networkyParameter, theta, representing a speech recognition networkdRepresenting parameters of a domain classification network, N representing the number of samples per training data block, LyRepresenting a loss function, L, of the speech recognition networkdThe loss function of the representative domain classification network aims to minimize the total cost function, namely minimize the loss of the voice recognition network and maximize the loss of the domain classification network, so that the feature of the feature extraction network maximizes the voice recognition, and the domain classification network cannot distinguish the loss. The three parameters are respectively optimized and can be decomposed into
Figure BDA0002850201910000094
Figure BDA0002850201910000095
Figure BDA0002850201910000096
Further carrying out gradient calculation and back transmission on the network to obtain a calculation formula
Figure BDA0002850201910000091
Figure BDA0002850201910000092
Figure BDA0002850201910000093
Where α is the learning rate, the negative sign of the gradient in the first sub-formula represents the operation of the gradient inversion layer.
In an example, the loss function of the domain classification network and the loss function of the speech recognition network may be a cross entropy loss function, or a KL divergence loss function, or the domain label is a soft label, and the loss function of the domain classification network is a uniformly distributed function obtained based on the soft label.
Finally, in step S405, the speech data to be recognized is input into the speech recognition model after training, and a speech recognition result is obtained.
In some embodiments, the training process of the voice recognition model is performed in the cloud server, as shown in fig. 6, the uploaded voice data is received, whether far-end voice exists in the voice data (the far-end voice is voice played by a speaker of the voice recognition device) is determined, if yes, the far-end voice is used as voice sample data of a source domain, a domain label for labeling the source domain and a text label for calling out the domain label of the source domain to form the voice sample data of the source domain, otherwise, the voice sample data is used as the voice sample data of the source domain, and/or the voice data obtained by performing echo cancellation processing on the voice data and then performing residual echo suppression processing on the voice sample data to form voice sample data of a target domain, and the voice sample data of the target domain and the voice of the target domain are input to the training network training feature extraction network sample data in batch, The trained feature extraction network and the trained voice recognition network are used as voice recognition models to recognize voice data under an echo scene or a non-echo scene. The method and the device can obtain far-end voice and echo estimated by an algorithm, and can further improve the recognition accuracy under the condition of residual distortion by additionally utilizing the characteristics of the data during the training of the voice recognition model.
In addition, the voice recognition method only models the residual distortion of the echo cancellation, and is not limited to a specific algorithm, and the echo cancellation algorithm is linear processing, so that the voice recognition model trained by the method has good generalization capability on the echo cancellation algorithm.
In other embodiments, the training process of the speech recognition model may be performed in a speech recognition device, as shown in fig. 7, the speech recognition device collects speech data through a sound pickup assembly, determines whether there is far-end speech in the speech data (where the far-end speech is speech played by a speaker of the speech recognition device), if so, takes the far-end speech as a speech sample of a source domain, otherwise, takes the speech data, and/or speech data obtained by performing echo cancellation processing on the speech data, and/or takes speech data obtained by performing echo cancellation processing on the speech data and then speaking residual echo suppression processing as a speech sample of a target domain. Judging whether the voice recognition equipment needs end-side self-learning (namely whether the voice recognition model needs training), if so, recognizing the language sample of the source domain through a preset acoustic model to obtain a corresponding text label and an automatic labeling domain label, forming the voice sample data of the source domain with the voice sample of the source domain, automatically labeling the domain label according to the voice sample of the target domain, and forming the voice sample data of the target domain with the voice sample of the target domain. And inputting the voice sample data of the source domain and the voice sample data of the target domain into a training network training feature extraction network, a domain classification network and a voice recognition network in batches, wherein the trained feature extraction network and the trained voice recognition network are used as voice recognition models to recognize the voice data under an echo scene or a non-echo scene.
On one hand, the end-side self-learning is realized, on the one hand, the data updating model is not required to be collected off line and then updated into the voice recognition equipment, but the data is collected on the end side according to the design principle, and the data marking by the user is not required, so that the iterative optimization of the algorithm can be automatically completed aiming at the problems of different degrees, the individualized optimization can be performed aiming at different equipment states of different users, and the robustness of the voice recognition of the equipment under the echo scene is improved; on the other hand, due to the fact that offline red marking and cloud appearing of user data are not involved, the training process can be automatically completed on the user equipment regularly, personal data of the user are guaranteed not to leave the user equipment, and privacy requirements of the user are guaranteed.
It can be understood that there are various implementations for determining whether there is far-end speech in the speech data, such as an energy determination method, where the far-end speech is a pure echo signal, and only signals need to be processed, so that only silence and non-silence need to be distinguished, and the optional standard is Pr>delta, it can be assumed that a far-end speech signal is present, and Pr is a short-term estimate of the far-end speech energy, typically chosen to be Pr(t)=(1-alpha)*Pr(t-1)+alpha*|x(t)|^2,Pr(t)For energy estimation at time t, Pr(t-1)For the estimation at time t-1, alpha is a constant, typically taken to be 0.95, x(t)Delta is a threshold value (i.e. a first preset threshold value) for the instantaneous amplitude of the far-end speech at time t, and may be, for example, 1e-5
In one example, the way to determine whether the speech recognition device needs end-side self-learning is: judging whether the voice sample data of the source domain and the voice sample data of the target domain are larger than or equal to a preset number, judging whether the difference between the confidence coefficient of the voice recognition result output by the voice recognition network aiming at the voice sample of the source domain and the confidence coefficient of the voice recognition result output by the voice recognition network aiming at the voice sample of the target domain is larger than or equal to a second preset threshold value, and if so, judging that the raincoat recognition equipment needs to carry out end-side self-learning.
For example, every fixed time, such as one week, and enough training samples are collected, such as the echo scene and the non-echo scene, respectively, the data of about 1 hour is collected for 1000 pieces (i.e. the preset number) of training samples, and the data of the source domain and the target domain has a significant difference in performance on the existing recognition system, and the index is expressed as the confidence of the speech recognition system, when the confidence of the data of the source domain and the target domain is higherWith significant difference, | Cs-Ct| is ≧ delta, wherein CsIs the mean confidence of the source domain data, CtFor the average confidence of the target domain data, delta is a threshold (i.e. a second preset threshold), for example, delta is greater than or equal to 0.3, CsAnd CtThe confidence coefficient is standard statistic of the existing speech recognition system and is used for measuring the index of speech recognition quality.
In some examples, the above-mentioned target domains are multiple target domains determined based on signal-echo energy ratios, for example, multiple target domains such as target domain one, echo energy ratio-20 dB, target domain two, echo energy ratio-15 dB, target domain three, echo energy ratio-10 dB, target domain four, echo energy ratio-5 dB, target domain five, echo energy 0, target domain six, echo energy ratio 5dB, target domain seven, echo energy ratio 10dB, and the like, and the corresponding domain classification network is a multi-domain classification task. On one hand, the target domain carries out multi-domain modeling according to the echo residual level, compared with the two-domain modeling, the method further improves the extracted feature invariance of the feature extraction network, further improves the voice recognition performance and enlarges the application range of the method; on the other hand, because the echo cancellation residual and the echo suppression module are not modeled in a distinguishing mode, the target domain can be designed to be the output of the echo suppression module, and therefore the echo control module can be flexibly designed.
In addition, because the method does not utilize channel information, the voice recognition method provided by the application can be applied to a single-channel echo cancellation algorithm, and can also be applied to echo cancellation extra distortion problems caused by channel correlation, such as stereo and multi-channel echo cancellation algorithms.
Besides the far-end voice data, the echo data can further comprise one or more of audio data processed by the acoustic cancellation module, echo estimation data of the echo cancellation module, residual echo data of the echo suppression module and echo path data estimated by the echo cancellation module, so that the performance of the algorithm is further improved, more scenes can be adapted, and the robustness of the trained voice recognition model is improved.
In some embodiments, the trained speech recognition model may be applied to a speech recognition system, and the speech recognition result is output as the speech content (i.e. the corresponding text content) of the speech data to be recognized. As shown in fig. 8, the speech recognition system includes a front-end processing module, a decoder (including a speech recognition model, a pronunciation dictionary, a speech model), and a text output module. The method comprises the steps of inputting speech to be recognized (the speech to be recognized in the application can be far-field speech or near-field speech), carrying out preprocessing such as noise reduction and feature extraction by a front-end processing module, transmitting feature data to a decoder, calculating a text sequence with the most matched features by the decoder according to resources such as a trained speech recognition model, a pronunciation dictionary and a language model, and outputting the text sequence in a text output module.
In other embodiments, the trained speech recognition model can be applied to a voice wake-up system, a voiceprint recognition system, and other systems involving speech recognition. The following is an example of the application of the speech recognition model to the voice wake-up system, which is an application scheme.
Fig. 9 is an architecture diagram of the voice wake-up system according to the present embodiment. As shown in fig. 9, the voice wake-up system includes a front-end processing module, a voice recognition model, a post-processing module, and a wake-up result module. The voice input module converts an acoustic signal of input voice into an electric signal, then transmits the electric signal to the front-end processing module, the front-end processing module preprocesses the electric signal, such as echo cancellation and denoising, and transmits the preprocessed data to the voice recognition module, the voice recognition module can adapt to complex acoustic environments (such as echo environments) and data processed by various front-end processing modules (such as echo cancellation), the preprocessed data is recognized and converted into a modeling unit module defined by a wakeup tag, then the modeling unit module is transmitted to the rear-end processing module, the output probability of the voice recognition module is converted into a wakeup confidence score, finally the wakeup result module outputs a wakeup result according to the confidence score, and whether equipment is wakened or not is controlled according to the wakeup result.
The embodiment of the present application further provides a speech recognition apparatus, as shown in fig. 10, the speech recognition apparatus 100 at least includes:
the acquiring module 101 is configured to acquire voice sample data of a source domain and voice sample data of a target domain, where the voice sample data of the source domain includes a source domain voice sample, a text label of the source domain voice sample, and a domain label of the source domain, and the voice sample data of the target domain includes a target domain voice sample and a domain label of the target domain; wherein the partial distribution characteristics of the source domain can be migrated and learned to a target domain, the source domain voice samples do not include echo data, and the target domain voice samples include echo data;
the extraction module 103 is configured to extract features of a source domain voice sample to obtain first voice features and extract features of a target domain voice sample to obtain second voice features based on the feature extraction network;
the training module 104 is configured to input the first speech feature as a sample feature into a speech recognition network to obtain a speech recognition result; inputting the first voice characteristic and the second voice characteristic into a domain classification network as sample characteristics to obtain a domain classification result; performing combined training on the feature extraction network, the voice recognition network and the domain classification network according to the voice recognition result and the text label and the domain classification result and the domain label to obtain a trained feature extraction network and a trained voice recognition network as voice recognition models;
and the recognition module 105 inputs the voice data to be recognized into the trained voice recognition model to obtain a voice recognition result.
In one possible implementation, the domain classification network includes a gradient inversion layer and a domain classification layer; and the gradient inversion layer enables the feature extraction network and the domain classification layer to form a confrontation relation, domain labels of a source domain voice sample and a source domain, domain labels of a target domain voice sample and a target domain are respectively input into the feature extraction network, and the feature extraction network and the domain classification layer are trained.
In another possible implementation, the training the feature extraction network and the domain classification layer includes: in the forward propagation training process, the voice features extracted by the feature extraction network are input into the domain classification layer through a gradient inversion layer, and the domain classification layer updates the parameters of the domain classification layer according to the domain classification result and the domain label; and in the back propagation training process, the gradient inversion layer inverts the gradient of the domain classification layer and then transmits the inverted gradient to the feature extraction network so as to update the parameters of the feature extraction network.
In another possible implementation, the training module 104 is specifically configured to: carrying out weighted summation on the loss function of the voice recognition network and the loss function of the domain classification network to obtain a total loss function; and performing joint training on the feature extraction network, the voice recognition network and the domain classification network by minimizing a total loss function.
In another possible implementation, the loss function of the domain classification network is a cross-entropy loss function, or a KL divergence loss function; or the domain label comprises a soft label, and the loss function of the domain classification network is a uniform distribution function obtained based on the soft label.
In another possible implementation, the obtaining module 101 is specifically configured to: collecting voice data, and judging whether the voice data comprises remote voice data; if so, taking the voice data as a target domain voice sample, and labeling a domain label of the target domain; if not, the voice data is used as a source domain voice sample, and the text label and the domain label of the source domain are labeled.
In another possible implementation, the determining whether the voice data includes far-end voice data includes: judging whether the far-end voice energy in the voice data is larger than a first preset threshold value or not; if yes, the remote voice data is included; if not, the remote voice data is not included.
In another possible implementation, the apparatus further includes: the judging module 102 is configured to judge whether the voice sample data of the source domain and the voice sample data of the target domain are greater than or equal to a preset number, and a difference between a confidence level of a voice recognition result output by the voice recognition network for the voice sample of the source domain and a confidence level of a voice recognition result output by the voice recognition network for the voice sample of the target domain is greater than or equal to a second preset threshold; and if so, executing the feature extraction network based on the features to extract the features of the source domain voice sample to obtain a first voice feature.
In another possible implementation, the target domain includes a plurality of target domains determined based on signal-to-echo energy ratios.
In another possible implementation, the echo data includes at least: one or more of audio data processed by the echo cancellation module, far-end voice data, echo estimation data of the echo cancellation module, residual echo data of the echo suppression module, and echo path data estimated by the echo cancellation module.
In another possible implementation, the training process of the speech recognition model is completed in a cloud server, and the text labels of the source domain speech samples are obtained based on manual labeling.
In another possible implementation, the training process of the speech recognition model is completed at the terminal device; the acquiring of the voice sample data of the source domain and the voice sample data of the target domain comprises: the terminal equipment collects voice data and judges whether the voice data comprises far-end voice data, if so, the voice data is used as a target domain voice sample and is labeled with a domain label of a target domain, and if not, the voice data is used as a source domain voice sample and is labeled with a domain label of a source domain; and the text label of the source domain voice sample is obtained based on the recognition of a preset acoustic model in the terminal equipment.
In another possible implementation, the voice recognition result is the voice content of the voice data to be recognized, or whether the voice data to be recognized contains a wakeup word, or whether the voiceprint feature of the voice data to be recognized matches a preset voiceprint feature.
The speech recognition apparatus 100 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module in the speech recognition apparatus 100 are respectively for implementing corresponding processes of each method in fig. 3 to 9, and are not described herein again for brevity.
It should be noted that the above-described embodiments are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform any of the methods described above.
The present application also provides a computer program or computer program product comprising instructions which, when executed, cause a computer to perform any of the methods described above.
The application also provides a voice recognition device, which comprises a memory and a processor, wherein the memory stores executable codes, and the processor executes the executable codes to realize any method.
Fig. 11 is a schematic structural diagram of a speech recognition device provided in the present application.
As shown in fig. 11, the speech recognition device 1100 includes a processor 1101, a memory 1102, a bus 1103, a microphone 1104, a speaker 1105, and a communication interface 1106. The processor 1101, the memory 1102, the microphone 1104, the speaker 1105, and the communication interface 1106 communicate via the bus 1103, or may communicate via other means such as wireless transmission. The microphone 1104 may pick up voice data, the speaker 1105 may play voice data, the communication interface may be used for communication with other communication devices, the memory 1102 may store executable program code, and the processor 1101 may call the program code stored in the memory 1102 to perform the voice recognition method in the aforementioned method embodiments.
It should be understood that in the embodiments of the present application, the processor 1101 may be a central processing unit CPU, and the processor 1101 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 1102 may include both read-only memory and random access memory, and provides instructions and data to the processor 1101. Memory 1102 may also include non-volatile random access memory. For example, memory 1102 may also store a training data set.
The memory 1102 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).
The bus 1103 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. But for purposes of clarity will be identified in the drawings as bus 1103.
It should be understood that the speech recognition apparatus 100 according to the embodiment of the present application may correspond to a speech recognition device in the embodiment of the present application, and may correspond to a corresponding main body executing the method shown in fig. 3 to 9 according to the embodiment of the present application, and the above and other operations and/or functions of each device in the speech recognition apparatus 1100 are respectively for implementing corresponding processes of each method of fig. 3 to 9, and are not described herein again for brevity.
It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments, objects, technical solutions and advantages of the present application are described in further detail, it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present application should be included in the scope of the present application.

Claims (21)

1. A speech recognition method, comprising:
acquiring voice sample data of a source domain and voice sample data of a target domain, wherein the voice sample data of the source domain comprises a voice sample of the source domain, a text label of the voice sample of the source domain and a domain label of the source domain, and the voice sample data of the target domain comprises a voice sample of the target domain and a domain label of the target domain; wherein the partial distribution characteristics of the source domain can be migrated and learned to a target domain, the source domain voice samples do not include echo data, and the target domain voice samples include echo data;
extracting the characteristics of the source domain voice sample to obtain first voice characteristics and extracting the characteristics of the target domain voice sample to obtain second voice characteristics on the basis of the characteristic extraction network;
inputting the first voice characteristic as a sample characteristic into a voice recognition network to obtain a voice recognition result; inputting the first voice characteristic and the second voice characteristic into a domain classification network as sample characteristics to obtain a domain classification result;
performing combined training on the feature extraction network, the voice recognition network and the domain classification network according to the voice recognition result and the text label and the domain classification result and the domain label to obtain a trained feature extraction network and a trained voice recognition network as voice recognition models;
and inputting the voice data to be recognized into the trained voice recognition model to obtain a voice recognition result.
2. The method of claim 1, wherein the domain classification network comprises a gradient inversion layer and a domain classification layer;
and the gradient inversion layer enables the feature extraction network and the domain classification layer to form a confrontation relation, domain labels of a source domain voice sample and a source domain, domain labels of a target domain voice sample and a target domain are respectively input into the feature extraction network, and the feature extraction network and the domain classification layer are trained.
3. The method of claim 2, wherein training the feature extraction network and domain classification layer comprises:
in the forward propagation training process, the voice features extracted by the feature extraction network are input into the domain classification layer through a gradient inversion layer, and the domain classification layer updates the parameters of the domain classification layer according to the domain classification result and the domain label;
and in the back propagation training process, the gradient inversion layer inverts the gradient of the domain classification layer and then transmits the inverted gradient to the feature extraction network so as to update the parameters of the feature extraction network.
4. The method according to any one of claims 1-3, wherein the jointly training the feature extraction network, the speech recognition network, and the domain classification network comprises:
carrying out weighted summation on the loss function of the voice recognition network and the loss function of the domain classification network to obtain a total loss function;
and performing joint training on the feature extraction network, the voice recognition network and the domain classification network by minimizing a total loss function.
5. The method according to claim 4, wherein the loss function of the domain classification network is a cross entropy loss function, or a KL divergence loss function;
or the domain label comprises a soft label, and the loss function of the domain classification network is a uniform distribution function obtained based on the soft label.
6. The method according to any one of claims 1-5, wherein said obtaining voice sample data of a source domain and voice sample data of a target domain comprises:
collecting voice data, and judging whether the voice data comprises remote voice data;
if so, taking the voice data as a target domain voice sample, and labeling a domain label of the target domain;
if not, the voice data is used as a source domain voice sample, and the text label and the domain label of the source domain are labeled.
7. The method of claim 6, wherein the determining whether the voice data comprises far-end voice data comprises:
judging whether the far-end voice energy in the voice data is larger than a first preset threshold value or not;
if yes, the remote voice data is included;
if not, the remote voice data is not included.
8. The method according to any one of claims 1-7, wherein the extracting the feature of the source domain voice sample based on the feature extraction network to obtain the first voice feature further comprises:
judging whether the voice sample data of the source domain and the voice sample data of the target domain are greater than or equal to a preset number, and judging whether the difference between the confidence coefficient of the voice recognition result output by the voice recognition network for the voice sample of the source domain and the confidence coefficient of the voice recognition result output by the voice recognition network for the voice sample of the target domain is greater than or equal to a second preset threshold;
and if so, executing the feature extraction network based on the features to extract the features of the source domain voice sample to obtain a first voice feature.
9. The method of any of claims 1-8, wherein the target domain comprises a plurality of target domains determined based on signal-to-echo energy ratios.
10. The method according to any of claims 1-9, wherein the echo data comprises at least: one or more of audio data processed by the echo cancellation module, far-end voice data, echo estimation data of the echo cancellation module, residual echo data of the echo suppression module, and echo path data estimated by the echo cancellation module.
11. The method according to any one of claims 1 to 10, wherein the training process of the speech recognition model is completed in a cloud server, and the text labels of the source domain speech samples are obtained based on manual labeling.
12. The method according to any one of claims 1-10, wherein the training process of the speech recognition model is performed at a terminal device;
the obtaining of the voice sample data of the source domain and the voice sample data of the target domain includes:
the terminal equipment acquires voice data and judges whether the voice data comprise far-end voice data, if so, the voice data are used as a target domain voice sample and are labeled with a domain label of a target domain, and if not, the voice data are used as a source domain voice sample and are labeled with a domain label of a source domain;
and the text label of the source domain voice sample is obtained based on the recognition of a preset acoustic model in the terminal equipment.
13. The method according to any one of claims 1 to 12, wherein the voice recognition result is a voice content of the voice data to be recognized, or whether the voice data to be recognized contains a wakeup word, or whether a voiceprint feature of the voice data to be recognized matches a preset voiceprint feature.
14. A speech recognition apparatus, comprising:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring voice sample data of a source domain and voice sample data of a target domain, the voice sample data of the source domain comprises a voice sample of the source domain, a text label of the voice sample of the source domain and a domain label of the source domain, and the voice sample data of the target domain comprises a voice sample of the target domain and a domain label of the target domain; wherein, the partial distribution characteristics of the source domain can be migrated and learned to a target domain, the source domain voice sample does not comprise echo data, and the target domain voice sample comprises echo data;
the extraction module is used for extracting the characteristics of the source domain voice sample to obtain first voice characteristics and extracting the characteristics of the target domain voice sample to obtain second voice characteristics based on the characteristic extraction network;
the training module is used for inputting the first voice characteristic serving as a sample characteristic into a voice recognition network to obtain a voice recognition result; inputting the first voice characteristic and the second voice characteristic into a domain classification network as sample characteristics to obtain a domain classification result; performing combined training on the feature extraction network, the voice recognition network and the domain classification network according to the voice recognition result and the text label and the domain classification result and the domain label to obtain a trained feature extraction network and a trained voice recognition network as voice recognition models;
and the recognition module is used for inputting the voice data to be recognized into the trained voice recognition model to obtain a voice recognition result.
15. The apparatus of claim 14, wherein the domain classification network comprises a gradient inversion layer and a domain classification layer;
and the gradient inversion layer enables the feature extraction network and the domain classification layer to form a confrontation relation, domain labels of a source domain voice sample and a source domain, domain labels of a target domain voice sample and a target domain are respectively input into the feature extraction network, and the feature extraction network and the domain classification layer are trained.
16. The apparatus of claim 15, wherein the training the feature extraction network and domain classification layer comprises:
in the forward propagation training process, the voice features extracted by the feature extraction network are input into the domain classification layer through a gradient inversion layer, and the domain classification layer updates the parameters of the domain classification layer according to the domain classification result and the domain label;
and in the back propagation training process, the gradient inversion layer inverts the gradient of the domain classification layer and then transmits the inverted gradient to the feature extraction network so as to update the parameters of the feature extraction network.
17. The apparatus according to any one of claims 14-16, wherein the training module is specifically configured to:
carrying out weighted summation on the loss function of the voice recognition network and the loss function of the domain classification network to obtain a total loss function;
and performing joint training on the feature extraction network, the voice recognition network and the domain classification network by minimizing a total loss function.
18. The apparatus according to any one of claims 14 to 17, wherein the obtaining module is specifically configured to:
collecting voice data, and judging whether the voice data comprises remote voice data;
if so, taking the voice data as a voice sample of a target domain, and labeling a domain label of the target domain;
if not, the voice data is used as a source domain voice sample, and the text label and the domain label of the source domain are labeled.
19. The apparatus of any of claims 14-18, further comprising:
the judging module is used for judging whether the voice sample data of the source domain and the voice sample data of the target domain are greater than or equal to a preset number, and the difference between the confidence coefficient of the voice recognition result output by the voice recognition network for the voice sample of the source domain and the confidence coefficient of the voice recognition result output by the voice recognition network for the voice sample of the target domain is greater than or equal to a second preset threshold;
and if so, executing the feature extraction network based on the features to extract the features of the source domain voice sample to obtain a first voice feature.
20. A speech recognition device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor executes the executable code to implement the method of any one of claims 1-13.
21. A computer-readable storage medium, on which a computer program is stored, which, when the computer program is executed in a computer, causes the computer to carry out the method of any one of claims 1-13.
CN202011524726.7A 2020-12-22 2020-12-22 Voice recognition method, device, equipment and storage medium Pending CN114664288A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011524726.7A CN114664288A (en) 2020-12-22 2020-12-22 Voice recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011524726.7A CN114664288A (en) 2020-12-22 2020-12-22 Voice recognition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114664288A true CN114664288A (en) 2022-06-24

Family

ID=82024803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011524726.7A Pending CN114664288A (en) 2020-12-22 2020-12-22 Voice recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114664288A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116072119A (en) * 2023-03-31 2023-05-05 北京华录高诚科技有限公司 Voice control system, method, electronic equipment and medium for emergency command

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116072119A (en) * 2023-03-31 2023-05-05 北京华录高诚科技有限公司 Voice control system, method, electronic equipment and medium for emergency command

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN110288978B (en) Speech recognition model training method and device
CN110600018B (en) Voice recognition method and device and neural network training method and device
US10515626B2 (en) Adaptive audio enhancement for multichannel speech recognition
Haeb-Umbach et al. Far-field automatic speech recognition
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
US9697826B2 (en) Processing multi-channel audio waveforms
WO2021135577A1 (en) Audio signal processing method and apparatus, electronic device, and storage medium
US11158333B2 (en) Multi-stream target-speech detection and channel fusion
JP2021516369A (en) Mixed speech recognition method, device and computer readable storage medium
CN107799126A (en) Sound end detecting method and device based on Supervised machine learning
CN110706707B (en) Method, apparatus, device and computer-readable storage medium for voice interaction
CN111883135A (en) Voice transcription method and device and electronic equipment
CN110867178B (en) Multi-channel far-field speech recognition method
CN113870893A (en) Multi-channel double-speaker separation method and system
CN114664288A (en) Voice recognition method, device, equipment and storage medium
KR20090116055A (en) Method for estimating noise mask using hidden markov model and apparatus for performing the same
Raj et al. Frustratingly easy noise-aware training of acoustic models
Pan et al. Application of hidden Markov models in speech command recognition
US20240038217A1 (en) Preprocessing Model Building System for Speech Recognition Function and Preprocessing Model Building Method Therefor
Samanta et al. An energy-efficient voice activity detector using reconfigurable Gaussian base normalization deep neural network
CN116705013A (en) Voice wake-up word detection method and device, storage medium and electronic equipment
CN115691473A (en) Voice endpoint detection method and device and storage medium
CN115376494A (en) Voice detection method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination