CN113345428B

CN113345428B - Speech recognition model matching method, device, equipment and storage medium

Info

Publication number: CN113345428B
Application number: CN202110627036.2A
Authority: CN
Inventors: 岑吴镕; 李骊
Original assignee: Beijing HJIMI Technology Co Ltd
Current assignee: Beijing HJIMI Technology Co Ltd
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2023-08-04
Anticipated expiration: 2041-06-04
Also published as: CN113345428A

Abstract

The application provides a matching method, a device, equipment and a storage medium of a voice recognition model, wherein the method comprises the steps of if the accuracy rate of the voice recognition model for recognizing current voice (referring to the voice collected in the current collection environment) is lower than an accuracy rate threshold value, making sample voice conforming to the current collection environment; obtaining a voice recognition result of the sample voice based on the voice recognition model, and determining a correction coefficient according to the accuracy of the voice recognition result of the sample voice; correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics; and decoding the corrected audio features by using the voice recognition model to obtain a voice recognition result of the current voice. When the accuracy of the voice recognition model is reduced, the scheme can complete the matching of the voice recognition model only by determining the correction coefficient according to the sample voice, and the voice recognition model does not need to be retrained, so that the efficiency of matching the voice recognition model is remarkably improved.

Description

Speech recognition model matching method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for matching a speech recognition model.

Background

The voice recognition based on the voice recognition model mainly comprises two parts, namely extracting the audio characteristics of voice and decoding the audio characteristics by utilizing the voice recognition model to obtain a voice recognition result (usually words corresponding to the voice). In practical use, the situation that the voice recognition result output by the voice recognition model is inaccurate due to the fact that the audio acquisition environment is not matched with the trained voice recognition model often occurs. For example, when a speech recognition model obtained by training an audio sample in an indoor environment is used to recognize the audio in an outdoor environment, the accuracy is lowered.

When this occurs, the speech recognition model needs to be matched to improve the accuracy of the model. The existing matching method generally comprises the steps of manufacturing audio frequency which accords with a specific acquisition environment as a training sample, and retraining a voice recognition model, so that the accuracy of the voice recognition model in the acquisition environment is improved.

Each time the speech recognition model is retrained, it takes a long time, and thus the existing matching schemes are inefficient.

Disclosure of Invention

In view of the foregoing drawbacks of the prior art, the present invention provides a method, apparatus, device, and storage medium for matching a speech recognition model, so as to provide an efficient matching scheme for a speech recognition model.

The first aspect of the present application provides a method for matching a speech recognition model, including:

if the accuracy rate is lower than the accuracy rate threshold value when the voice recognition model recognizes the current voice, making sample voice conforming to the current acquisition environment; wherein, the current voice refers to the voice collected in the current collection environment;

obtaining a voice recognition result of the sample voice based on the voice recognition model, and determining a correction coefficient according to the accuracy of the voice recognition result of the sample voice;

correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics;

and decoding the corrected audio features by using the voice recognition model to obtain a voice recognition result of the current voice.

Optionally, the obtaining the speech recognition result of the sample speech based on the speech recognition model, and determining the correction coefficient according to the accuracy of the speech recognition result of the sample speech, includes:

obtaining a plurality of candidate coefficients;

for each candidate coefficient, correcting the audio characteristics of the sample voice by using the candidate coefficient to obtain corrected sample audio characteristics corresponding to the candidate coefficient;

respectively decoding the corrected sample audio features corresponding to each candidate coefficient by using the voice recognition model to obtain a voice recognition result of the sample voice corresponding to each candidate coefficient;

and selecting the recognition result of the sample voice with the highest accuracy from the voice recognition results of a plurality of sample voices, and determining the candidate coefficient corresponding to the recognition result of the sample voice with the highest accuracy as a correction coefficient.

Optionally, the correcting the audio feature of the current voice by using the correction coefficient to obtain a corrected audio feature includes:

and multiplying the correction coefficient with the audio feature of the current voice, and taking the obtained product as the corrected audio feature.

Optionally, the making a sample voice conforming to the current collection environment includes:

obtaining pre-recorded initial speech;

and adding noise information conforming to the current acquisition environment into the initial voice to obtain sample voice conforming to the current acquisition environment.

A second aspect of the present application provides a device for matching a speech recognition model, including:

the manufacturing unit is used for manufacturing sample voice conforming to the current acquisition environment if the accuracy rate of the voice recognition model for recognizing the current voice is lower than the accuracy rate threshold value; wherein, the current voice refers to the voice collected in the current collection environment;

a determining unit, configured to obtain a speech recognition result of the sample speech based on the speech recognition model, and determine a correction coefficient according to an accuracy of the speech recognition result of the sample speech;

the correction unit is used for correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics;

and the decoding unit is used for decoding the corrected audio characteristics by utilizing the voice recognition model to obtain a voice recognition result of the current voice.

Optionally, the determining unit obtains a speech recognition result of the sample speech based on the speech recognition model, and determines the correction coefficient according to an accuracy of the speech recognition result of the sample speech, where the determining unit is specifically configured to:

obtaining a plurality of candidate coefficients;

Optionally, the correction unit corrects the audio feature of the current voice by using the correction coefficient, and is specifically configured to:

Optionally, when the making unit makes the sample voice conforming to the current collection environment, the making unit is specifically configured to:

obtaining pre-recorded initial speech;

A third aspect of the present application provides an electronic device comprising a memory and a processor;

wherein the memory is used for storing a computer program;

the processor is configured to execute the computer program, and in particular, is configured to implement a method for matching a speech recognition model provided in any one of the first aspects of the present application.

A fourth aspect of the present application provides a computer storage medium for storing a computer program, which when executed is specifically configured to implement the method for matching a speech recognition model provided in any one of the first aspects of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for matching a speech recognition model according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a matching device of a speech recognition model according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the field of speech recognition, if the trained speech recognition model is inconsistent with the actual speech acquisition condition, it is difficult to obtain a good recognition result. In this case, it is often necessary to recreate sample data matching the actual acquisition situation and then retrain the speech recognition model, thereby improving recognition accuracy. But this process requires significant time costs and computational resources. The patent proposes a new method, by modifying model parameters, thereby achieving the purpose of data matching.

That is, in the prior art, it takes a lot of time to recreate the sample, and it also takes a lot of time and computing resources to retrain.

According to the scheme, the purpose of matching the model with the actual environment can be achieved by adding the correction coefficient according to the characteristics of the actual environment. Therefore, the method only needs to make a small amount of sample data which accords with the actual situation, and retrains the voice recognition model is not needed, so that the time for matching the voice recognition model can be greatly reduced, and the matching efficiency is improved.

Referring to fig. 1, the method for matching a speech recognition model may include the following steps:

s101, if the accuracy rate of the voice recognition model for recognizing the current voice is lower than an accuracy rate threshold value, making sample voice conforming to the current acquisition environment.

Wherein, the current voice refers to the voice collected under the current collection environment.

It should be noted that the number of the current voices in step S101 is generally plural. Specifically, one or more users (or testers) can be respectively recorded for multiple times in the current collection environment, so that multiple pieces of current voices generated by the users (or testers) in the current collection environment can be obtained. The number of the produced sample voices can be plural.

The accuracy of the speech recognition model may be represented by the proportion of the plurality of current voices that the correct voice is recognized, for example, if the speech recognition result of only 12 current voices is correct in the 20 current voices, the accuracy of the speech recognition model may be considered to be 70% when the speech recognition model is used for recognizing the voices in the current acquisition environment.

The speech recognition model of the present invention may be a time delay neural network model (time delay neural networks, TDNN).

The speech recognition model can be obtained through training the following training model processing flow:

each sample audio is first subjected to audio framing, typically a segment of audio is made up of a plurality of consecutive sample points, each L consecutive sample points can be divided into one audio frame when audio framing is performed, and each time an audio frame is divided, K sample points are moved backward from the first sample point of the audio frame, and consecutive L sample points after being taken again from the moved sample point are taken as another audio frame, and so on, whereby a segment of sample audio can be divided into a plurality of audio frames.

Generally, L may be set to 512, or may be set to 400, k may be set to 160, or may be adjusted to other integer values according to the actual situation. When L is 512 and k is 160, the above-mentioned audio frame division corresponds to taking every 512 sampling points as one audio frame, and shifting every 160 sampling points.

After obtaining a plurality of audio frames, feature vectors of each audio frame may be extracted by the following feature extraction procedure:

for each audio frame, pre-emphasis is performed according to the following formula:

Y _t+1 ＝X _t+1 -b×X _t

wherein X is _t The value of the sampling point at the time t is represented by X _t+1 A value representing the sampling point at time t+1, Y _t+1 The value representing the sampling point at time t+1 after pre-emphasis, b is the pre-emphasis coefficient, which ranges from 0.95 to 1. The first sample point of the audio is unchanged.

And adding a Hamming window into the pre-emphasized audio frame, performing fast Fourier transform on the audio frame, and converting the audio from a time domain to a frequency domain to obtain a frequency spectrum of each audio frame.

Finally, the following formula is adopted:

the spectrum of the audio frame is converted into a mel spectrum, the mel spectrum is equally divided into a 71-dimensional triangular filter, and the triangular filter is converted back into the frequency domain. The frequency domain corresponding energy is passed through the triangular filter to obtain a 71 dimensional eigenvector of the audio frame. After the feature vectors of the audio frames are weighted and combined, the audio features of the sample speech can be obtained.

The audio features of the sample speech may be filter bank (FBank) features or Mel-cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) features.

Finally, the time delay neural network can be trained by utilizing the audio characteristics of the sample voice, and the specific training process can refer to the related prior art, which is not described herein.

Optionally, making a sample voice that meets the current collection environment includes:

obtaining pre-recorded initial speech;

noise information conforming to the current acquisition environment is added into the initial voice, and sample voice conforming to the current acquisition environment is obtained.

Specifically, in step S101, training data corresponding to the actual situation when the current voice is collected, such as room size, noise type, etc., may be generated.

S102, obtaining a voice recognition result of the sample voice based on the voice recognition model, and determining a correction coefficient according to the accuracy of the voice recognition result of the sample voice.

The executing process of step S102 may include:

obtaining a plurality of candidate coefficients;

for each alternative coefficient, correcting the audio characteristics of the sample voice by using the alternative coefficient to obtain corrected sample audio characteristics corresponding to the alternative coefficient;

respectively decoding the corrected sample audio features corresponding to each alternative coefficient by using a voice recognition model to obtain a voice recognition result of sample voice corresponding to each alternative coefficient;

and selecting the recognition result of the sample voice with the highest accuracy from the voice recognition results of the plurality of sample voices, and determining the candidate coefficient corresponding to the recognition result of the sample voice with the highest accuracy as the correction coefficient.

The audio features of the sample speech may also be extracted from the sample speech according to the feature extraction procedure in step S101, which is not described herein.

In particular, it is possible to traverse from 0.5 to 2.0, with each step of 0.1, i.e. each time taking 0.5,0.6,0.7, etc. up to 2.0, so as to obtain different alternative coefficients a.

Then, for each candidate coefficient, correcting the audio characteristics of the sample voice by using the candidate coefficient to obtain corrected sample audio characteristics. For example, the correction process can be described by the following formula, where A1, A2, … … are sequentially written for a plurality of candidate coefficients, m is written for the audio feature of the sample speech, and n is written for the corrected sample audio feature:

n＝A1×m

through the formula, the corrected sample audio characteristics obtained by correcting the alternative coefficients A1 are obtained, and similarly, the corrected sample audio characteristics corresponding to each alternative coefficient can be obtained by replacing A1 in the formula with A2, A3 … … and the like.

And finally, decoding the corrected sample audio features corresponding to each alternative coefficient by utilizing the voice recognition model to obtain a voice recognition result, and determining the accuracy corresponding to each alternative coefficient.

Assuming that the number of the sample voices is 20, for the alternative coefficient A1, 20 corrected audio characteristics can be obtained by correcting the audio characteristics of the 20 sample voices by using the A1, 20 voice recognition results are obtained by decoding a voice recognition model, then the proportion of correct results in the 20 voice recognition results is judged, the accuracy corresponding to the obtained alternative coefficient A1 is obtained, and the accuracy of other alternative coefficients can be obtained in turn.

And finally, selecting the alternative coefficient with the highest accuracy as the correction coefficient. For example, assuming that the accuracy of the candidate coefficient 1.5 is highest, the correction coefficient is determined to be 1.5, which may be denoted as b=1.5, and B represents the correction coefficient.

S103, correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics.

Optionally, correcting the audio feature of the current speech by using the correction coefficient to obtain a corrected audio feature, including:

and multiplying the correction coefficient with the audio characteristic of the current voice, and taking the obtained product as the corrected audio characteristic.

Specifically, the audio feature of the current speech is represented by M, and the corrected audio feature is represented by N, and then step S103 may be represented by the following formula:

N＝B×M

where B is the correction coefficient determined in step S102.

That is, in the present invention, after the correction coefficient is determined, each time the voice recognition is performed on the voice collected in the current collecting environment by using the voice recognition model, firstly, an audio feature needs to be extracted from the voice (specifically, as described in the feature extraction flow in step S101), then, according to step S103, the audio feature is corrected by using the correction coefficient to obtain a corrected audio feature, and finally, the corrected audio feature is input into the voice recognition model, and the voice recognition model decodes the corrected audio feature to obtain the voice recognition result of the current voice (whereas the prior art directly decodes the extracted audio feature by using the voice recognition model).

S104, decoding the corrected audio features by using the voice recognition model to obtain a voice recognition result of the current voice.

In the experiment implemented based on the above flow, the accuracy of the voice recognition model before the specific acquisition environment (such as the outdoor environment) is not matched is 88.9%, and after the matching is performed according to the method provided by the invention, the accuracy of the voice recognition model reaches 100%.

The application provides a matching method of a voice recognition model, which comprises the steps of if the accuracy rate of the voice recognition model for recognizing current voice (referring to the voice collected in the current collection environment) is lower than an accuracy rate threshold value, manufacturing sample voice conforming to the current collection environment; obtaining a voice recognition result of the sample voice based on the voice recognition model, and determining a correction coefficient according to the accuracy of the voice recognition result of the sample voice; correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics; and decoding the corrected audio features by using the voice recognition model to obtain a voice recognition result of the current voice. When the accuracy of the voice recognition model is reduced, the scheme can complete the matching of the voice recognition model only by determining the correction coefficient according to the sample voice, and the voice recognition model does not need to be retrained, so that the efficiency of matching the voice recognition model is remarkably improved.

Therefore, the method can achieve the purpose of matching the voice recognition model with the current acquisition and exchange machine only by a small amount of sample voice and a small amount of calculation resources.

Although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or electronic device. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In combination with the method for matching a speech recognition model provided in the embodiment of the present application, the embodiment of the present application further provides a device for matching a speech recognition model, referring to fig. 2, the device may include the following units:

the making unit 201 is configured to make a sample voice according with the current collection environment if the accuracy rate is lower than the accuracy rate threshold when the voice recognition model recognizes the current voice.

A determining unit 202, configured to obtain a speech recognition result of the sample speech based on the speech recognition model, and determine the correction coefficient according to an accuracy of the speech recognition result of the sample speech.

The correcting unit 203 is configured to correct the audio feature of the current speech by using the correction coefficient, so as to obtain a corrected audio feature.

The decoding unit 204 is configured to decode the corrected audio feature by using the speech recognition model, so as to obtain a speech recognition result of the current speech.

Optionally, when the determining unit 202 obtains a speech recognition result of the sample speech based on the speech recognition model and determines the correction coefficient according to the accuracy of the speech recognition result of the sample speech, the determining unit is specifically configured to:

obtaining a plurality of candidate coefficients;

Optionally, the correction unit 203 corrects the audio feature of the current voice by using the correction coefficient, and is specifically configured to:

Optionally, when the making unit 201 makes a sample voice that meets the current collection environment, the making unit is specifically configured to:

obtaining pre-recorded initial speech;

The specific working principle of the device for matching a speech recognition model provided in the embodiment of the present application may refer to relevant steps in the method for matching a speech recognition model provided in any embodiment of the present application, which is not described herein again.

The application provides a matching device of a voice recognition model, wherein if the accuracy rate of the voice recognition model for recognizing the current voice (referring to the voice collected under the current collection environment) is lower than an accuracy rate threshold, a making unit 201 makes a sample voice conforming to the current collection environment; the determination unit 202 obtains a speech recognition result of the sample speech based on the speech recognition model, and determines a correction coefficient according to the accuracy of the speech recognition result of the sample speech; the correction unit 203 corrects the audio feature of the current voice by using the correction coefficient to obtain a corrected audio feature; the decoding unit 204 decodes the corrected audio feature using the speech recognition model to obtain a speech recognition result of the current speech. When the accuracy of the voice recognition model is reduced, the scheme can complete the matching of the voice recognition model only by determining the correction coefficient according to the sample voice, and the voice recognition model does not need to be retrained, so that the efficiency of matching the voice recognition model is remarkably improved.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

An embodiment of the present application further provides an electronic device, please refer to fig. 3, which includes a memory 301 and a processor 302.

Wherein the memory 301 is used for storing a computer program;

the processor 302 is configured to execute a computer program, and is specifically configured to implement a method for matching a speech recognition model provided in any embodiment of the present application.

The application also provides a computer storage medium for storing a computer program, which is specifically used for realizing the matching method of the speech recognition model provided by any embodiment of the application when the computer program is executed.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

It should be noted that the terms "first," "second," and the like herein are merely used for distinguishing between different devices, modules, or units and not for limiting the order or interdependence of the functions performed by such devices, modules, or units.

Those skilled in the art can make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for matching a speech recognition model, comprising:

obtaining a plurality of candidate coefficients;

respectively decoding the corrected sample audio features corresponding to each candidate coefficient by using the voice recognition model to obtain a voice recognition result of the sample voice corresponding to each candidate coefficient, and determining the accuracy corresponding to each candidate coefficient;

selecting the recognition result of the sample voice with highest accuracy from the voice recognition results of a plurality of sample voices, and determining the candidate coefficient corresponding to the recognition result of the sample voice with highest accuracy as a correction coefficient;

2. The matching method according to claim 1, wherein said modifying the audio feature of the current speech with the correction coefficient to obtain a modified audio feature comprises:

3. The matching method according to claim 1, wherein said producing sample speech conforming to a current acquisition environment comprises:

obtaining pre-recorded initial speech;

4. A device for matching a speech recognition model, comprising:

the decoding unit is used for decoding the corrected audio characteristics by utilizing the voice recognition model to obtain a voice recognition result of the current voice;

the determining unit obtains a speech recognition result of the sample speech based on the speech recognition model, and determines a correction coefficient according to an accuracy of the speech recognition result of the sample speech, and is specifically configured to:

obtaining a plurality of candidate coefficients;

5. The matching device according to claim 4, wherein the correction unit corrects the audio feature of the current speech using the correction coefficient, and is specifically configured to:

6. The matching device according to claim 4, wherein the producing unit is configured to, when producing the sample speech conforming to the current collection environment:

obtaining pre-recorded initial speech;

7. An electronic device comprising a memory and a processor;

wherein the memory is used for storing a computer program;

the processor is configured to execute the computer program, in particular to implement a method for matching a speech recognition model according to any of claims 1 to 3.

8. A computer storage medium storing a computer program, which, when executed, is adapted to carry out a method of matching a speech recognition model according to any one of claims 1 to 3.