CN112908305A

CN112908305A - Method and equipment for improving accuracy of voice recognition

Info

Publication number: CN112908305A
Application number: CN202110132053.9A
Authority: CN
Inventors: 范红亮; 蒋莹; 李轶杰; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-01-30
Filing date: 2021-01-30
Publication date: 2021-06-04
Anticipated expiration: 2041-01-30
Also published as: CN112908305B

Abstract

The invention relates to a method and equipment for improving speech recognition accuracy, which are applied to an ASR system provided with an SDM and used for speech recognition, wherein the ASR system is provided with a decoding network used for decoding; the method comprises the following steps: acquiring original audio input into an ASR system and historical decoding information output by a decoding network through the SDM; processing the original audio through the SDM to obtain a plurality of signal characteristics of the original audio; the final characteristics of the original audio are obtained by SDM processing based on a plurality of signal characteristics and historical decoding information. The SDM is added in the decoding stage of the ASR system, the information of each dimension, including signal characteristics directly obtained from audio, context information obtained from historical decoding information and the like, is fully utilized, and an original acoustic model trained through mass data in the ASR system is combined, so that the scoring and identifying capability of the ASR system on input voice in any complex scene can be improved, and the recognition rate is improved.

Description

Method and equipment for improving accuracy of voice recognition

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method and equipment for improving voice recognition accuracy.

Background

The performance of an ASR (Automatic Speech Recognition) system is greatly affected by environmental factors, and when a complex scene is encountered, such as a large environmental noise or a large deviation from training data, a large challenge is posed to the performance of a Recognition engine. Particularly, acoustic scoring is very inaccurate, which has a crucial influence on the recognition result, and engine acoustic scoring is inaccurate, which further affects the accuracy of the final recognition result.

Recognition errors of ASR systems in complex scenes, one of the most common types of errors is insertion errors due to background noise (environmental noise or background human voice, etc.): due to the limitations of model structures and training data, voices and non-voices under a plurality of complex scenes cannot be well distinguished, and the non-voices of the background are recognized into voices by mistake, so that redundant recognition results are generated, namely insertion errors are generated.

In order to cope with high insertion errors in complex scenes, a general current practice is to arrange a VAD (Voice Activity Detection) module at the front end of an ASR system engine to distinguish human voices from non-human voices first, and then to send only pure human voices to the ASR system engine for recognition. However, the disadvantages of this approach are also evident, in particular the following:

VAD is not a standard fit for ASR systems, many ASR systems do not have VAD modules;

2. even if VAD is used to extract voice parts, the voice parts are not necessarily good for recognition (on one hand, VAD does not judge voice accurately, and on the other hand, ASR system recognition needs context information, even if it is non-voice frequency, it is very useful for recognition)

VAD cannot distinguish between target and background vocal interference (e.g., television background noise).

Thus, there is a need for a better solution to the problems of the prior art.

Disclosure of Invention

The invention provides a method and equipment for improving speech recognition accuracy, which can solve the technical problem of low recognition rate in the prior art.

The technical scheme for solving the technical problems is as follows:

the embodiment of the invention provides a method for improving the accuracy of speech recognition, which is applied to an ASR system provided with an SDM and used for speech recognition, wherein the ASR system is provided with a decoding network used for decoding; the method comprises the following steps:

acquiring original audio input into the ASR system and historical decoding information output by the decoding network through the SDM;

processing the original audio through the SDM to obtain a plurality of signal characteristics of the original audio;

and processing by the SDM based on the plurality of signal features and the historical decoding information to obtain final features of the original audio.

In a specific embodiment, the method further comprises the following steps:

and outputting the final characteristics to the decoding network through the SDM so as to enable the decoding network to decode to obtain an identification text.

In a specific embodiment, the signal characteristics include: signal-to-noise ratio, energy, zero crossing rate.

In a specific embodiment, the historical decoding information includes context information.

In a particular embodiment, an acoustic model is also included in the ASR system;

the acoustic scoring of the decoding network comprises: scoring the acoustic model and the SDM; wherein the score of the acoustic model and the score of the SDM each correspond to a respective weight.

In a specific embodiment, the scoring of the SDMs includes: a first score derived from signal features of the original audio, a second score of the original audio derived based on the historical decoding information; the first score and the second score each correspond to a respective weight.

The embodiment of the invention also provides equipment for improving the accuracy of speech recognition, which is applied to an ASR system provided with an SDM and used for speech recognition, wherein the ASR system is provided with a decoding network used for decoding; the apparatus comprises:

the acquisition module is used for acquiring original audio input into the ASR system and historical decoding information output by the decoding network through the SDM;

the first processing module is used for processing the original audio through the SDM to obtain a plurality of signal characteristics of the original audio;

and the second processing module is used for processing the SDM based on the plurality of signal characteristics and the historical decoding information to obtain the final characteristics of the original audio.

In a specific embodiment, the method further comprises the following steps:

and the identification module is used for outputting the final characteristics to the decoding network through the SDM so as to enable the decoding network to decode to obtain an identification text.

The invention has the beneficial effects that:

the SDM is added in the decoding stage of the ASR system, the information of each dimension, including signal characteristics directly obtained from audio, context information obtained from historical decoding information and the like, is fully utilized, and an original acoustic model trained through mass data in the ASR system is combined, so that the scoring and identifying capability of the ASR system on input voice in any complex scene can be improved, and the recognition rate is improved. The accuracy of acoustic scoring can be improved, and the overall performance of the ASR system is improved.

Drawings

FIG. 1 is a block diagram of a prior art ASR system according to an embodiment of the present invention;

FIG. 2 is a block diagram of an ASR system applied in a method for improving speech recognition accuracy according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for improving speech recognition accuracy according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an apparatus for improving accuracy of speech recognition according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an apparatus for improving accuracy of speech recognition according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

The method for improving the accuracy of voice recognition provided by the embodiment of the invention is applied to an ASR system which is provided with an SDM (Speech Detection Module) and is used for voice recognition, wherein the ASR system is provided with a decoding network for decoding; as shown in fig. 1, which is a schematic diagram of a prior art ASR system, a conventional ASR system includes: a training phase and a decoding phase; wherein, the training stage: training an Acoustic Model (AM) by utilizing a voice database based on technologies such as a deep neural network and the like; and training a Language Model (LM) by utilizing a text database based on technologies such as ngram and a deep neural network. And a decoding stage: the acoustic model, the language model and the pronunciation dictionary obtained in the training stage can form a decoding network. After the input audio is subjected to feature extraction, an optimal path can be found out from a decoding network through a decoding algorithm, and a final recognition result is obtained. However, the high insertion error of the Acoustic Model (AM) in a complex scene is mainly that when the Acoustic Model (AM) calculates an acoustic score, human voice and non-human voice cannot be accurately distinguished, and a noise part is misjudged as human voice. And the acoustic score used for decoding is directly from the score of an Acoustic Model (AM), and whether the human voice or the non-human voice is also directly dependent on the performance of the acoustic model.

As shown in fig. 2, which is a schematic diagram of a framework of the ASR system in the present solution, an SDM (Speech Detection Module) is added in the Speech recognition engine to dynamically detect the generation of Speech, assist the engine in judging human voices and non-human voices, and make up for the deficiency of the acoustic model in scoring, thereby improving the recognition accuracy of the ASR system.

As shown in fig. 3, the method comprises the steps of:

step 101, obtaining original audio input into the ASR system and historical decoding information output by the decoding network through the SDM; specifically, the historical decoding information includes context information.

Step 102, processing the original audio through the SDM to obtain a plurality of signal characteristics of the original audio; specifically, the signal characteristics include: signal to noise ratio, energy, zero crossing rate, etc.

Step 103, processing by the SDM based on the plurality of signal features and the historical decoding information to obtain final features of the original audio.

Specifically, in the scheme, the SDM input added in the decoding link of the ASR system has two inputs: input audio and historical decoding information. The output has one: the module is for the characteristics of the current input speech.

The SDM directly aims at an input audio clip A to obtain a group of characteristics Feat _ A, and the characteristics represent the characteristics of input audio from multiple dimensions such as signal-to-noise ratio, energy, zero crossing rate and the like;

the history information already obtained by the decoding network can be used as input, and together with the above feature Feat _ a, as the judgment of the current input audio by the voice detection module, the feature Feat _ B is output as the current output feature of the voice detection module. The information obtained by decoding the network is the determined context information, and has higher reference value: and extracting the human voice and non-human voice characteristics of the current scene from the human voice and non-human voice with known historical information.

Therefore, compared with the scoring of an Acoustic Model (AM), the output characteristic Feat _ B of the SDM describes the current input audio from multiple dimensions; by jointly using the characteristics of the original audio and the context characteristics obtained by acoustic decoding, the respective defects can be made up, more accurate judgment on the human voice/non-human voice in the complex scene can be obtained, the scoring identification capability of the engine on the complex scene is improved, the identification insertion errors are reduced, and the identification rate is improved.

Further, after step 103, as shown in fig. 2, the method further includes:

Further, an acoustic model is also included in the ASR system;

Specifically, the scoring of the SDMs includes: a first score derived from signal features of the original audio, a second score of the original audio derived based on the historical decoding information; the first score and the second score each correspond to a respective weight.

Specifically, the acoustic score of the decoding network can be represented by the following formula:

s_AM′＝w_AMs_AM+w_SDM(w_{SDM_Audios}S_{DM_Audio}+w_{SDM_History_Decs}S_{DM_History_Dec})；

in particular, the acoustic score s used for decoding after SDM is added_AM' is made up of two parts, one part is the scoring s directly from the acoustic model_AMIt weights w in the final score_AM(ii) a Another part is the scoring of the SDM, which is weighted w in the final score_SDMThe score for SDM, in turn, is derived from the following two components:

partly a score s derived directly from the original audio information_{SDM_Audio}The method is obtained through signal characteristics such as audio signal-to-noise ratio, energy magnitude, zero crossing rate and the like, and the weight is as follows: w is a_{SDM_Audio}。

Yet another part is dependency history solutionCode information, and thus a score s of the current audio_{SDM_History_Dec}. Historically, the decoded information is often more reliable, and has strong directivity to the current speech characteristics (audio has chronology and short-time stability). Two features of the currently identified scene can be derived from historical decoding information: a Speech feature, Feat _ Speech, and a non-Speech feature, Feat _ NonSpeech. Which feature of the current audio is more preferred may be considered to be currently Speech or non-Speech. With a weight of w_{SDM_History_Dec}。

The scheme is characterized in that a voice detection module is newly added in the decoding stage of the ASR system, information of all dimensions including signal characteristics directly obtained from audio and context information obtained from historical decoding information is fully utilized, and an acoustic model trained through mass data is combined, so that the scoring and identifying capability of the ASR system on input voice in any complex scene can be improved, and the recognition rate is improved. The accuracy of acoustic scoring can be improved, and the overall performance of the ASR system engine is further improved. According to the scheme, through the application of the multidimensional characteristics, more comprehensive and reasonable scoring can be performed on the input audio in any complex scene, the insertion errors caused by complex environments are reduced, and the identification accuracy of the system is improved.

Example 2

Embodiment 2 of the present invention also discloses a device for improving accuracy of speech recognition, as shown in fig. 4, which is applied to an ASR system for speech recognition provided with SDM, the ASR system being provided with a decoding network for decoding; the apparatus comprises:

an obtaining module 201, configured to obtain, by the SDM, an original audio input to the ASR system and historical decoding information output by the decoding network;

a first processing module 202, configured to process the original audio through the SDM to obtain a plurality of signal features of the original audio;

a second processing module 203, configured to perform processing by the SDM based on the plurality of signal features and the historical decoding information to obtain a final feature of the original audio.

In a specific embodiment, as shown in fig. 5, the method further includes:

the identifying module 204 is configured to output the final feature to the decoding network through the SDM, so that the decoding network decodes the final feature to obtain an identifying text.

In a specific embodiment, the signal characteristics include: signal to noise ratio, energy, zero crossing rate, etc.

The invention relates to a method and equipment for improving speech recognition accuracy, which are applied to an ASR system provided with an SDM and used for speech recognition, wherein the ASR system is provided with a decoding network used for decoding; the method comprises the following steps: acquiring original audio input into the ASR system and historical decoding information output by the decoding network through the SDM; processing the original audio through the SDM to obtain a plurality of signal characteristics of the original audio; and processing by the SDM based on the plurality of signal features and the historical decoding information to obtain final features of the original audio. The SDM is added in the decoding stage of the ASR system, the information of each dimension, including signal characteristics directly obtained from audio, context information obtained from historical decoding information and the like, is fully utilized, and an original acoustic model trained through mass data in the ASR system is combined, so that the scoring and identifying capability of the ASR system on input voice in any complex scene can be improved, and the recognition rate is improved. The accuracy of acoustic scoring can be improved, and the overall performance of the ASR system is improved.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for improving speech recognition accuracy is applied to an ASR system provided with an SDM and used for speech recognition, wherein the ASR system is provided with a decoding network used for decoding; the method comprises the following steps:

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein the signal features comprise: signal-to-noise ratio, energy, zero crossing rate.

4. The method of claim 1, wherein the historical decoding information comprises context information.

5. The method according to claim 1, further comprising an acoustic model in the ASR system;

6. The method of claim 5, wherein scoring the SDM comprises: a first score derived from signal features of the original audio, a second score of the original audio derived based on the historical decoding information; the first score and the second score each correspond to a respective weight.

7. The device for improving the speech recognition accuracy is applied to an ASR system which is provided with an SDM and used for speech recognition, wherein the ASR system is provided with a decoding network used for decoding; the apparatus comprises:

8. The apparatus of claim 7, further comprising:

9. The apparatus of claim 7, wherein the signal features comprise: signal-to-noise ratio, energy, zero crossing rate.

10. The apparatus of claim 7, wherein the historical decoding information comprises context information.