CN113345428B - Speech recognition model matching method, device, equipment and storage medium - Google Patents

Speech recognition model matching method, device, equipment and storage medium Download PDF

Info

Publication number
CN113345428B
CN113345428B CN202110627036.2A CN202110627036A CN113345428B CN 113345428 B CN113345428 B CN 113345428B CN 202110627036 A CN202110627036 A CN 202110627036A CN 113345428 B CN113345428 B CN 113345428B
Authority
CN
China
Prior art keywords
voice
sample
recognition model
current
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110627036.2A
Other languages
Chinese (zh)
Other versions
CN113345428A (en
Inventor
岑吴镕
李骊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing HJIMI Technology Co Ltd
Original Assignee
Beijing HJIMI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing HJIMI Technology Co Ltd filed Critical Beijing HJIMI Technology Co Ltd
Priority to CN202110627036.2A priority Critical patent/CN113345428B/en
Publication of CN113345428A publication Critical patent/CN113345428A/en
Application granted granted Critical
Publication of CN113345428B publication Critical patent/CN113345428B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Abstract

The application provides a matching method, a device, equipment and a storage medium of a voice recognition model, wherein the method comprises the steps of if the accuracy rate of the voice recognition model for recognizing current voice (referring to the voice collected in the current collection environment) is lower than an accuracy rate threshold value, making sample voice conforming to the current collection environment; obtaining a voice recognition result of the sample voice based on the voice recognition model, and determining a correction coefficient according to the accuracy of the voice recognition result of the sample voice; correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics; and decoding the corrected audio features by using the voice recognition model to obtain a voice recognition result of the current voice. When the accuracy of the voice recognition model is reduced, the scheme can complete the matching of the voice recognition model only by determining the correction coefficient according to the sample voice, and the voice recognition model does not need to be retrained, so that the efficiency of matching the voice recognition model is remarkably improved.

Description

Speech recognition model matching method, device, equipment and storage medium
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for matching a speech recognition model.
Background
The voice recognition based on the voice recognition model mainly comprises two parts, namely extracting the audio characteristics of voice and decoding the audio characteristics by utilizing the voice recognition model to obtain a voice recognition result (usually words corresponding to the voice). In practical use, the situation that the voice recognition result output by the voice recognition model is inaccurate due to the fact that the audio acquisition environment is not matched with the trained voice recognition model often occurs. For example, when a speech recognition model obtained by training an audio sample in an indoor environment is used to recognize the audio in an outdoor environment, the accuracy is lowered.
When this occurs, the speech recognition model needs to be matched to improve the accuracy of the model. The existing matching method generally comprises the steps of manufacturing audio frequency which accords with a specific acquisition environment as a training sample, and retraining a voice recognition model, so that the accuracy of the voice recognition model in the acquisition environment is improved.
Each time the speech recognition model is retrained, it takes a long time, and thus the existing matching schemes are inefficient.
Disclosure of Invention
In view of the foregoing drawbacks of the prior art, the present invention provides a method, apparatus, device, and storage medium for matching a speech recognition model, so as to provide an efficient matching scheme for a speech recognition model.
The first aspect of the present application provides a method for matching a speech recognition model, including:
if the accuracy rate is lower than the accuracy rate threshold value when the voice recognition model recognizes the current voice, making sample voice conforming to the current acquisition environment; wherein, the current voice refers to the voice collected in the current collection environment;
obtaining a voice recognition result of the sample voice based on the voice recognition model, and determining a correction coefficient according to the accuracy of the voice recognition result of the sample voice;
correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics;
and decoding the corrected audio features by using the voice recognition model to obtain a voice recognition result of the current voice.
Optionally, the obtaining the speech recognition result of the sample speech based on the speech recognition model, and determining the correction coefficient according to the accuracy of the speech recognition result of the sample speech, includes:
obtaining a plurality of candidate coefficients;
for each candidate coefficient, correcting the audio characteristics of the sample voice by using the candidate coefficient to obtain corrected sample audio characteristics corresponding to the candidate coefficient;
respectively decoding the corrected sample audio features corresponding to each candidate coefficient by using the voice recognition model to obtain a voice recognition result of the sample voice corresponding to each candidate coefficient;
and selecting the recognition result of the sample voice with the highest accuracy from the voice recognition results of a plurality of sample voices, and determining the candidate coefficient corresponding to the recognition result of the sample voice with the highest accuracy as a correction coefficient.
Optionally, the correcting the audio feature of the current voice by using the correction coefficient to obtain a corrected audio feature includes:
and multiplying the correction coefficient with the audio feature of the current voice, and taking the obtained product as the corrected audio feature.
Optionally, the making a sample voice conforming to the current collection environment includes:
obtaining pre-recorded initial speech;
and adding noise information conforming to the current acquisition environment into the initial voice to obtain sample voice conforming to the current acquisition environment.
A second aspect of the present application provides a device for matching a speech recognition model, including:
the manufacturing unit is used for manufacturing sample voice conforming to the current acquisition environment if the accuracy rate of the voice recognition model for recognizing the current voice is lower than the accuracy rate threshold value; wherein, the current voice refers to the voice collected in the current collection environment;
a determining unit, configured to obtain a speech recognition result of the sample speech based on the speech recognition model, and determine a correction coefficient according to an accuracy of the speech recognition result of the sample speech;
the correction unit is used for correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics;
and the decoding unit is used for decoding the corrected audio characteristics by utilizing the voice recognition model to obtain a voice recognition result of the current voice.
Optionally, the determining unit obtains a speech recognition result of the sample speech based on the speech recognition model, and determines the correction coefficient according to an accuracy of the speech recognition result of the sample speech, where the determining unit is specifically configured to:
obtaining a plurality of candidate coefficients;
for each candidate coefficient, correcting the audio characteristics of the sample voice by using the candidate coefficient to obtain corrected sample audio characteristics corresponding to the candidate coefficient;
respectively decoding the corrected sample audio features corresponding to each candidate coefficient by using the voice recognition model to obtain a voice recognition result of the sample voice corresponding to each candidate coefficient;
and selecting the recognition result of the sample voice with the highest accuracy from the voice recognition results of a plurality of sample voices, and determining the candidate coefficient corresponding to the recognition result of the sample voice with the highest accuracy as a correction coefficient.
Optionally, the correction unit corrects the audio feature of the current voice by using the correction coefficient, and is specifically configured to:
and multiplying the correction coefficient with the audio feature of the current voice, and taking the obtained product as the corrected audio feature.
Optionally, when the making unit makes the sample voice conforming to the current collection environment, the making unit is specifically configured to:
obtaining pre-recorded initial speech;
and adding noise information conforming to the current acquisition environment into the initial voice to obtain sample voice conforming to the current acquisition environment.
A third aspect of the present application provides an electronic device comprising a memory and a processor;
wherein the memory is used for storing a computer program;
the processor is configured to execute the computer program, and in particular, is configured to implement a method for matching a speech recognition model provided in any one of the first aspects of the present application.
A fourth aspect of the present application provides a computer storage medium for storing a computer program, which when executed is specifically configured to implement the method for matching a speech recognition model provided in any one of the first aspects of the present application.
The application provides a matching method, a device, equipment and a storage medium of a voice recognition model, wherein the method comprises the steps of if the accuracy rate of the voice recognition model for recognizing current voice (referring to the voice collected in the current collection environment) is lower than an accuracy rate threshold value, making sample voice conforming to the current collection environment; obtaining a voice recognition result of the sample voice based on the voice recognition model, and determining a correction coefficient according to the accuracy of the voice recognition result of the sample voice; correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics; and decoding the corrected audio features by using the voice recognition model to obtain a voice recognition result of the current voice. When the accuracy of the voice recognition model is reduced, the scheme can complete the matching of the voice recognition model only by determining the correction coefficient according to the sample voice, and the voice recognition model does not need to be retrained, so that the efficiency of matching the voice recognition model is remarkably improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for matching a speech recognition model according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a matching device of a speech recognition model according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the field of speech recognition, if the trained speech recognition model is inconsistent with the actual speech acquisition condition, it is difficult to obtain a good recognition result. In this case, it is often necessary to recreate sample data matching the actual acquisition situation and then retrain the speech recognition model, thereby improving recognition accuracy. But this process requires significant time costs and computational resources. The patent proposes a new method, by modifying model parameters, thereby achieving the purpose of data matching.
That is, in the prior art, it takes a lot of time to recreate the sample, and it also takes a lot of time and computing resources to retrain.
According to the scheme, the purpose of matching the model with the actual environment can be achieved by adding the correction coefficient according to the characteristics of the actual environment. Therefore, the method only needs to make a small amount of sample data which accords with the actual situation, and retrains the voice recognition model is not needed, so that the time for matching the voice recognition model can be greatly reduced, and the matching efficiency is improved.
Referring to fig. 1, the method for matching a speech recognition model may include the following steps:
s101, if the accuracy rate of the voice recognition model for recognizing the current voice is lower than an accuracy rate threshold value, making sample voice conforming to the current acquisition environment.
Wherein, the current voice refers to the voice collected under the current collection environment.
It should be noted that the number of the current voices in step S101 is generally plural. Specifically, one or more users (or testers) can be respectively recorded for multiple times in the current collection environment, so that multiple pieces of current voices generated by the users (or testers) in the current collection environment can be obtained. The number of the produced sample voices can be plural.
The accuracy of the speech recognition model may be represented by the proportion of the plurality of current voices that the correct voice is recognized, for example, if the speech recognition result of only 12 current voices is correct in the 20 current voices, the accuracy of the speech recognition model may be considered to be 70% when the speech recognition model is used for recognizing the voices in the current acquisition environment.
The speech recognition model of the present invention may be a time delay neural network model (time delay neural networks, TDNN).
The speech recognition model can be obtained through training the following training model processing flow:
each sample audio is first subjected to audio framing, typically a segment of audio is made up of a plurality of consecutive sample points, each L consecutive sample points can be divided into one audio frame when audio framing is performed, and each time an audio frame is divided, K sample points are moved backward from the first sample point of the audio frame, and consecutive L sample points after being taken again from the moved sample point are taken as another audio frame, and so on, whereby a segment of sample audio can be divided into a plurality of audio frames.
Generally, L may be set to 512, or may be set to 400, k may be set to 160, or may be adjusted to other integer values according to the actual situation. When L is 512 and k is 160, the above-mentioned audio frame division corresponds to taking every 512 sampling points as one audio frame, and shifting every 160 sampling points.
After obtaining a plurality of audio frames, feature vectors of each audio frame may be extracted by the following feature extraction procedure:
for each audio frame, pre-emphasis is performed according to the following formula:
Y t+1 =X t+1 -b×X t
wherein X is t The value of the sampling point at the time t is represented by X t+1 A value representing the sampling point at time t+1, Y t+1 The value representing the sampling point at time t+1 after pre-emphasis, b is the pre-emphasis coefficient, which ranges from 0.95 to 1. The first sample point of the audio is unchanged.
And adding a Hamming window into the pre-emphasized audio frame, performing fast Fourier transform on the audio frame, and converting the audio from a time domain to a frequency domain to obtain a frequency spectrum of each audio frame.
Finally, the following formula is adopted:
the spectrum of the audio frame is converted into a mel spectrum, the mel spectrum is equally divided into a 71-dimensional triangular filter, and the triangular filter is converted back into the frequency domain. The frequency domain corresponding energy is passed through the triangular filter to obtain a 71 dimensional eigenvector of the audio frame. After the feature vectors of the audio frames are weighted and combined, the audio features of the sample speech can be obtained.
The audio features of the sample speech may be filter bank (FBank) features or Mel-cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) features.
Finally, the time delay neural network can be trained by utilizing the audio characteristics of the sample voice, and the specific training process can refer to the related prior art, which is not described herein.
Optionally, making a sample voice that meets the current collection environment includes:
obtaining pre-recorded initial speech;
noise information conforming to the current acquisition environment is added into the initial voice, and sample voice conforming to the current acquisition environment is obtained.
Specifically, in step S101, training data corresponding to the actual situation when the current voice is collected, such as room size, noise type, etc., may be generated.
S102, obtaining a voice recognition result of the sample voice based on the voice recognition model, and determining a correction coefficient according to the accuracy of the voice recognition result of the sample voice.
The executing process of step S102 may include:
obtaining a plurality of candidate coefficients;
for each alternative coefficient, correcting the audio characteristics of the sample voice by using the alternative coefficient to obtain corrected sample audio characteristics corresponding to the alternative coefficient;
respectively decoding the corrected sample audio features corresponding to each alternative coefficient by using a voice recognition model to obtain a voice recognition result of sample voice corresponding to each alternative coefficient;
and selecting the recognition result of the sample voice with the highest accuracy from the voice recognition results of the plurality of sample voices, and determining the candidate coefficient corresponding to the recognition result of the sample voice with the highest accuracy as the correction coefficient.
The audio features of the sample speech may also be extracted from the sample speech according to the feature extraction procedure in step S101, which is not described herein.
In particular, it is possible to traverse from 0.5 to 2.0, with each step of 0.1, i.e. each time taking 0.5,0.6,0.7, etc. up to 2.0, so as to obtain different alternative coefficients a.
Then, for each candidate coefficient, correcting the audio characteristics of the sample voice by using the candidate coefficient to obtain corrected sample audio characteristics. For example, the correction process can be described by the following formula, where A1, A2, … … are sequentially written for a plurality of candidate coefficients, m is written for the audio feature of the sample speech, and n is written for the corrected sample audio feature:
n=A1×m
through the formula, the corrected sample audio characteristics obtained by correcting the alternative coefficients A1 are obtained, and similarly, the corrected sample audio characteristics corresponding to each alternative coefficient can be obtained by replacing A1 in the formula with A2, A3 … … and the like.
And finally, decoding the corrected sample audio features corresponding to each alternative coefficient by utilizing the voice recognition model to obtain a voice recognition result, and determining the accuracy corresponding to each alternative coefficient.
Assuming that the number of the sample voices is 20, for the alternative coefficient A1, 20 corrected audio characteristics can be obtained by correcting the audio characteristics of the 20 sample voices by using the A1, 20 voice recognition results are obtained by decoding a voice recognition model, then the proportion of correct results in the 20 voice recognition results is judged, the accuracy corresponding to the obtained alternative coefficient A1 is obtained, and the accuracy of other alternative coefficients can be obtained in turn.
And finally, selecting the alternative coefficient with the highest accuracy as the correction coefficient. For example, assuming that the accuracy of the candidate coefficient 1.5 is highest, the correction coefficient is determined to be 1.5, which may be denoted as b=1.5, and B represents the correction coefficient.
S103, correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics.
Optionally, correcting the audio feature of the current speech by using the correction coefficient to obtain a corrected audio feature, including:
and multiplying the correction coefficient with the audio characteristic of the current voice, and taking the obtained product as the corrected audio characteristic.
Specifically, the audio feature of the current speech is represented by M, and the corrected audio feature is represented by N, and then step S103 may be represented by the following formula:
N=B×M
where B is the correction coefficient determined in step S102.
That is, in the present invention, after the correction coefficient is determined, each time the voice recognition is performed on the voice collected in the current collecting environment by using the voice recognition model, firstly, an audio feature needs to be extracted from the voice (specifically, as described in the feature extraction flow in step S101), then, according to step S103, the audio feature is corrected by using the correction coefficient to obtain a corrected audio feature, and finally, the corrected audio feature is input into the voice recognition model, and the voice recognition model decodes the corrected audio feature to obtain the voice recognition result of the current voice (whereas the prior art directly decodes the extracted audio feature by using the voice recognition model).
S104, decoding the corrected audio features by using the voice recognition model to obtain a voice recognition result of the current voice.
In the experiment implemented based on the above flow, the accuracy of the voice recognition model before the specific acquisition environment (such as the outdoor environment) is not matched is 88.9%, and after the matching is performed according to the method provided by the invention, the accuracy of the voice recognition model reaches 100%.
The application provides a matching method of a voice recognition model, which comprises the steps of if the accuracy rate of the voice recognition model for recognizing current voice (referring to the voice collected in the current collection environment) is lower than an accuracy rate threshold value, manufacturing sample voice conforming to the current collection environment; obtaining a voice recognition result of the sample voice based on the voice recognition model, and determining a correction coefficient according to the accuracy of the voice recognition result of the sample voice; correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics; and decoding the corrected audio features by using the voice recognition model to obtain a voice recognition result of the current voice. When the accuracy of the voice recognition model is reduced, the scheme can complete the matching of the voice recognition model only by determining the correction coefficient according to the sample voice, and the voice recognition model does not need to be retrained, so that the efficiency of matching the voice recognition model is remarkably improved.
Therefore, the method can achieve the purpose of matching the voice recognition model with the current acquisition and exchange machine only by a small amount of sample voice and a small amount of calculation resources.
Although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or electronic device. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In combination with the method for matching a speech recognition model provided in the embodiment of the present application, the embodiment of the present application further provides a device for matching a speech recognition model, referring to fig. 2, the device may include the following units:
the making unit 201 is configured to make a sample voice according with the current collection environment if the accuracy rate is lower than the accuracy rate threshold when the voice recognition model recognizes the current voice.
Wherein, the current voice refers to the voice collected under the current collection environment.
A determining unit 202, configured to obtain a speech recognition result of the sample speech based on the speech recognition model, and determine the correction coefficient according to an accuracy of the speech recognition result of the sample speech.
The correcting unit 203 is configured to correct the audio feature of the current speech by using the correction coefficient, so as to obtain a corrected audio feature.
The decoding unit 204 is configured to decode the corrected audio feature by using the speech recognition model, so as to obtain a speech recognition result of the current speech.
Optionally, when the determining unit 202 obtains a speech recognition result of the sample speech based on the speech recognition model and determines the correction coefficient according to the accuracy of the speech recognition result of the sample speech, the determining unit is specifically configured to:
obtaining a plurality of candidate coefficients;
for each alternative coefficient, correcting the audio characteristics of the sample voice by using the alternative coefficient to obtain corrected sample audio characteristics corresponding to the alternative coefficient;
respectively decoding the corrected sample audio features corresponding to each alternative coefficient by using a voice recognition model to obtain a voice recognition result of sample voice corresponding to each alternative coefficient;
and selecting the recognition result of the sample voice with the highest accuracy from the voice recognition results of the plurality of sample voices, and determining the candidate coefficient corresponding to the recognition result of the sample voice with the highest accuracy as the correction coefficient.
Optionally, the correction unit 203 corrects the audio feature of the current voice by using the correction coefficient, and is specifically configured to:
and multiplying the correction coefficient with the audio characteristic of the current voice, and taking the obtained product as the corrected audio characteristic.
Optionally, when the making unit 201 makes a sample voice that meets the current collection environment, the making unit is specifically configured to:
obtaining pre-recorded initial speech;
noise information conforming to the current acquisition environment is added into the initial voice, and sample voice conforming to the current acquisition environment is obtained.
The specific working principle of the device for matching a speech recognition model provided in the embodiment of the present application may refer to relevant steps in the method for matching a speech recognition model provided in any embodiment of the present application, which is not described herein again.
The application provides a matching device of a voice recognition model, wherein if the accuracy rate of the voice recognition model for recognizing the current voice (referring to the voice collected under the current collection environment) is lower than an accuracy rate threshold, a making unit 201 makes a sample voice conforming to the current collection environment; the determination unit 202 obtains a speech recognition result of the sample speech based on the speech recognition model, and determines a correction coefficient according to the accuracy of the speech recognition result of the sample speech; the correction unit 203 corrects the audio feature of the current voice by using the correction coefficient to obtain a corrected audio feature; the decoding unit 204 decodes the corrected audio feature using the speech recognition model to obtain a speech recognition result of the current speech. When the accuracy of the voice recognition model is reduced, the scheme can complete the matching of the voice recognition model only by determining the correction coefficient according to the sample voice, and the voice recognition model does not need to be retrained, so that the efficiency of matching the voice recognition model is remarkably improved.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
An embodiment of the present application further provides an electronic device, please refer to fig. 3, which includes a memory 301 and a processor 302.
Wherein the memory 301 is used for storing a computer program;
the processor 302 is configured to execute a computer program, and is specifically configured to implement a method for matching a speech recognition model provided in any embodiment of the present application.
The application also provides a computer storage medium for storing a computer program, which is specifically used for realizing the matching method of the speech recognition model provided by any embodiment of the application when the computer program is executed.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
It should be noted that the terms "first," "second," and the like herein are merely used for distinguishing between different devices, modules, or units and not for limiting the order or interdependence of the functions performed by such devices, modules, or units.
Those skilled in the art can make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A method for matching a speech recognition model, comprising:
if the accuracy rate is lower than the accuracy rate threshold value when the voice recognition model recognizes the current voice, making sample voice conforming to the current acquisition environment; wherein, the current voice refers to the voice collected in the current collection environment;
obtaining a plurality of candidate coefficients;
for each candidate coefficient, correcting the audio characteristics of the sample voice by using the candidate coefficient to obtain corrected sample audio characteristics corresponding to the candidate coefficient;
respectively decoding the corrected sample audio features corresponding to each candidate coefficient by using the voice recognition model to obtain a voice recognition result of the sample voice corresponding to each candidate coefficient, and determining the accuracy corresponding to each candidate coefficient;
selecting the recognition result of the sample voice with highest accuracy from the voice recognition results of a plurality of sample voices, and determining the candidate coefficient corresponding to the recognition result of the sample voice with highest accuracy as a correction coefficient;
correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics;
and decoding the corrected audio features by using the voice recognition model to obtain a voice recognition result of the current voice.
2. The matching method according to claim 1, wherein said modifying the audio feature of the current speech with the correction coefficient to obtain a modified audio feature comprises:
and multiplying the correction coefficient with the audio feature of the current voice, and taking the obtained product as the corrected audio feature.
3. The matching method according to claim 1, wherein said producing sample speech conforming to a current acquisition environment comprises:
obtaining pre-recorded initial speech;
and adding noise information conforming to the current acquisition environment into the initial voice to obtain sample voice conforming to the current acquisition environment.
4. A device for matching a speech recognition model, comprising:
the manufacturing unit is used for manufacturing sample voice conforming to the current acquisition environment if the accuracy rate of the voice recognition model for recognizing the current voice is lower than the accuracy rate threshold value; wherein, the current voice refers to the voice collected in the current collection environment;
a determining unit, configured to obtain a speech recognition result of the sample speech based on the speech recognition model, and determine a correction coefficient according to an accuracy of the speech recognition result of the sample speech;
the correction unit is used for correcting the audio characteristics of the current voice by using the correction coefficient to obtain corrected audio characteristics;
the decoding unit is used for decoding the corrected audio characteristics by utilizing the voice recognition model to obtain a voice recognition result of the current voice;
the determining unit obtains a speech recognition result of the sample speech based on the speech recognition model, and determines a correction coefficient according to an accuracy of the speech recognition result of the sample speech, and is specifically configured to:
obtaining a plurality of candidate coefficients;
for each candidate coefficient, correcting the audio characteristics of the sample voice by using the candidate coefficient to obtain corrected sample audio characteristics corresponding to the candidate coefficient;
respectively decoding the corrected sample audio features corresponding to each candidate coefficient by using the voice recognition model to obtain a voice recognition result of the sample voice corresponding to each candidate coefficient, and determining the accuracy corresponding to each candidate coefficient;
and selecting the recognition result of the sample voice with the highest accuracy from the voice recognition results of a plurality of sample voices, and determining the candidate coefficient corresponding to the recognition result of the sample voice with the highest accuracy as a correction coefficient.
5. The matching device according to claim 4, wherein the correction unit corrects the audio feature of the current speech using the correction coefficient, and is specifically configured to:
and multiplying the correction coefficient with the audio feature of the current voice, and taking the obtained product as the corrected audio feature.
6. The matching device according to claim 4, wherein the producing unit is configured to, when producing the sample speech conforming to the current collection environment:
obtaining pre-recorded initial speech;
and adding noise information conforming to the current acquisition environment into the initial voice to obtain sample voice conforming to the current acquisition environment.
7. An electronic device comprising a memory and a processor;
wherein the memory is used for storing a computer program;
the processor is configured to execute the computer program, in particular to implement a method for matching a speech recognition model according to any of claims 1 to 3.
8. A computer storage medium storing a computer program, which, when executed, is adapted to carry out a method of matching a speech recognition model according to any one of claims 1 to 3.
CN202110627036.2A 2021-06-04 2021-06-04 Speech recognition model matching method, device, equipment and storage medium Active CN113345428B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110627036.2A CN113345428B (en) 2021-06-04 2021-06-04 Speech recognition model matching method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110627036.2A CN113345428B (en) 2021-06-04 2021-06-04 Speech recognition model matching method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113345428A CN113345428A (en) 2021-09-03
CN113345428B true CN113345428B (en) 2023-08-04

Family

ID=77475253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110627036.2A Active CN113345428B (en) 2021-06-04 2021-06-04 Speech recognition model matching method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113345428B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7024359B2 (en) * 2001-01-31 2006-04-04 Qualcomm Incorporated Distributed voice recognition system using acoustic feature vector modification
JP2002366187A (en) * 2001-06-08 2002-12-20 Sony Corp Device and method for recognizing voice, program and recording medium
JP5161183B2 (en) * 2009-09-29 2013-03-13 日本電信電話株式会社 Acoustic model adaptation apparatus, method, program, and recording medium
US9288594B1 (en) * 2012-12-17 2016-03-15 Amazon Technologies, Inc. Auditory environment recognition
CN104392718B (en) * 2014-11-26 2017-11-24 河海大学 A kind of robust speech recognition methods based on acoustic model array
JP6637333B2 (en) * 2016-02-23 2020-01-29 日本放送協会 Acoustic model generation device and its program
CN107146615A (en) * 2017-05-16 2017-09-08 南京理工大学 Audio recognition method and system based on the secondary identification of Matching Model

Also Published As

Publication number Publication date
CN113345428A (en) 2021-09-03

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN107369439B (en) Voice awakening method and device
KR100745976B1 (en) Method and apparatus for classifying voice and non-voice using sound model
US9997168B2 (en) Method and apparatus for signal extraction of audio signal
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
CN113488060B (en) Voiceprint recognition method and system based on variation information bottleneck
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
CN114333865A (en) Model training and tone conversion method, device, equipment and medium
CN115457938A (en) Method, device, storage medium and electronic device for identifying awakening words
Soni et al. State-of-the-art analysis of deep learning-based monaural speech source separation techniques
CN111640423B (en) Word boundary estimation method and device and electronic equipment
CN110930997B (en) Method for labeling audio by using deep learning model
CN113345428B (en) Speech recognition model matching method, device, equipment and storage medium
CN114758645A (en) Training method, device and equipment of speech synthesis model and storage medium
CN111354352B (en) Automatic template cleaning method and system for audio retrieval
CN113470652A (en) Voice recognition and processing method based on industrial Internet
Li et al. Dynamic attention based generative adversarial network with phase post-processing for speech enhancement
WO2021062705A1 (en) Single-sound channel robustness speech keyword real-time detection method
CN113129920B (en) Music and human voice separation method based on U-shaped network and audio fingerprint
CN112447169B (en) Word boundary estimation method and device and electronic equipment
CN111833897B (en) Voice enhancement method for interactive education
CN116230012B (en) Two-stage abnormal sound detection method based on metadata comparison learning pre-training
CN113744754B (en) Enhancement processing method and device for voice signal
Xia et al. Simulation of Multi-Band Anti-Noise Broadcast Host Speech Recognition Method Based on Multi-Core Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant