CN113327597B

CN113327597B - Speech recognition method, medium, device and computing equipment

Info

Publication number: CN113327597B
Application number: CN202110698074.7A
Authority: CN
Inventors: 杨震
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2023-08-22
Anticipated expiration: 2041-06-23
Also published as: CN113327597A

Abstract

The embodiment of the disclosure provides a voice recognition method, a medium, a device and a computing device. The method comprises the following steps: extracting features of the audio data to be identified to obtain acoustic features corresponding to the audio data; the acoustic features are input into a plurality of voice recognition models trained in advance to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, and the target probability distributions are fused and decoded to obtain the recognition result of the audio data. The embodiment of the disclosure can improve the accuracy of Chinese voice recognition, can be suitable for various scenes, and improves the robustness of the model.

Description

Speech recognition method, medium, device and computing equipment

Technical Field

Embodiments of the present disclosure relate to the field of speech recognition technology, and more particularly, embodiments of the present disclosure relate to a multi-model-based speech recognition method, medium, apparatus, and computing device.

Background

This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

At present, the problem of inaccurate recognition often occurs in some specific scenes of a Chinese speech recognition model, wherein the speech fragment to be recognized comprises multiple homophones, rare Chinese characters, uneven distribution of training data and other scenes. For example, the voice data "i am sweaty and bathed" is easily misidentified as "i am sweaty and bathed early". The 'bathe' and the 'early' belong to homophones, the frequency of the 'early' occurrence in training data is far higher than that of the 'bathe', and the situation of unbalanced distribution exists, so that the identification result is inaccurate.

The related art has two ways to solve the above problem, the first way is data augmentation, and the second way is multi-model secondary scoring. Data augmentation refers to performing some operations on speech data that are prone to recognition errors, such as volume disturbance, speech rate adjustment, spectral masking, etc., thereby increasing the diversity of the data. The multi-model secondary scoring refers to training by using two models in a multi-task learning mode, generating a plurality of most likely candidate text sequences by using one model in the recognition stage, and secondarily scoring the plurality of candidate text sequences by using the other model, thereby screening out the most likely text sequences as recognition results.

However, the first method only slightly adjusts the original audio data, and the corresponding text labels are not changed, and new data label information is not introduced, so that the improvement of the recognition rate of the scene can be limited, and the situation of rare words which are not originally existed in the voice data set cannot be solved. The second mode has higher requirement on the accuracy of the model for generating the candidate text sequence, and if the candidate text sequence generated by the model has no correct result, the accuracy of recognition cannot be improved by secondary scoring; moreover, when two models adopt different modeling units, such as a Chinese character modeling unit and a pinyin modeling unit, the mapping space between the different modeling units in the secondary scoring can be explosively increased, the calculated amount is huge, and the implementation is difficult in an actual production environment.

Disclosure of Invention

The present disclosure desirably provides a speech recognition method and apparatus.

In a first aspect of embodiments of the present disclosure, there is provided a voice recognition method, including:

extracting features of audio data to be identified to obtain acoustic features corresponding to the audio data;

inputting the acoustic features into a plurality of pre-trained voice recognition models to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, wherein the target probability distributions correspond to a plurality of levels of acoustic tag systems, and the corresponding target probability distributions represent the matching degree between each acoustic tag and the acoustic feature under the level of acoustic tag systems;

And carrying out fusion decoding on the target probability distributions to obtain the identification result of the audio data.

In one embodiment of the present disclosure, the plurality of levels of acoustic tagging system comprises at least two of: a text-level acoustic taggant, a syllable-level acoustic taggant, a phone-level acoustic taggant, and a phone-level taggant with contextual background information.

In one embodiment of the present disclosure, the performing fusion decoding on the multiple target probability distributions to obtain the recognition result of the audio data includes:

constructing decoding paths of each voice recognition model according to the target probability distributions, wherein the decoding paths represent the recognition process of the corresponding voice recognition models on acoustic features, and the decoding paths obtained after the recognition is finished represent acoustic tags recognized for the acoustic features;

and calculating a decoding objective function based on the decoding paths of the voice recognition models, finding a decoding path which maximizes the decoding objective function, and taking the decoding path as a recognition result of the audio data.

In one embodiment of the present disclosure, the constructing a decoding path of each speech recognition model according to the plurality of target probability distributions includes:

And determining elements in the acoustic tag system corresponding to each voice recognition model, and constructing a decoding path of the voice recognition model by taking the elements as prefixes and based on target probability distribution corresponding to the voice recognition model.

In one embodiment of the present disclosure, the constructing a decoding path of the speech recognition model based on the target probability distribution corresponding to the speech recognition model with the element as a prefix includes:

selecting a candidate result of a next element according to target probability distribution of the next element by taking a first element in an acoustic label system corresponding to the voice recognition model as a prefix, and constructing a current decoding path for decoding by the prefix and the candidate result;

and the like, taking the last decoding path as the current prefix in each decoding, and combining the next element to construct the current decoding path until the complete decoding path is obtained.

In one embodiment of the present disclosure, the selecting the candidate result of the next element according to the target probability distribution of the next element includes:

sorting the recognition results of the next element according to the target probability distribution from high to low, and selecting the target probability distribution of the designated number with the previous sorting;

And taking the identification result corresponding to the selected target probability distribution as a candidate result of the next element.

In one embodiment of the present disclosure, the calculating a decoding objective function based on the decoding paths of the respective speech recognition models includes:

calculating prefix scores corresponding to the respective speech recognition models based on the decoding paths of the respective speech recognition models;

and multiplying the prefix scores of the voice recognition models with the set corresponding weights, and then summing all obtained products to obtain a decoding objective function.

In one embodiment of the present disclosure, the calculating a prefix score corresponding to the respective speech recognition model based on the decoding paths of the respective speech recognition models includes:

and calculating the selection rate of the decoding paths of each voice recognition model, summing the obtained selection rates of all decoding paths, and taking the logarithm to obtain the prefix score of the voice recognition model.

In one embodiment of the present disclosure, the finding a decoding path that maximizes the decoding objective function, taking the decoding path as the recognition result of the audio data, includes:

uniformly converting decoding paths of each voice recognition model in the decoding objective function into decoding paths of designated levels by adopting a preset sparse matrix, wherein the sparse matrix is a mapping relation among acoustic tag system elements of different levels and is used for converting acoustic tag system elements of one level into acoustic tag system elements of another level;

And finding a decoding path which maximizes the decoding objective function, and taking the decoding path as a recognition result of the audio data at the designated level.

In one embodiment of the present disclosure, the specified level is a text level, syllable level, phoneme level, or phoneme level with contextual background information.

In one embodiment of the present disclosure, the method further comprises at least one of:

if the statement end mark is detected during decoding, ending decoding;

if the mute mark is detected and exceeds the appointed time length during decoding, ending decoding;

and if the current state accords with the appointed ending state during decoding, ending decoding.

In one embodiment of the disclosure, the feature extraction of the audio data to be identified to obtain acoustic features corresponding to the audio data includes:

sampling audio data to be identified according to windows and intervals of specified duration;

performing discrete Fourier transform on the sampling points in each window;

calculating energy of a Mel space according to the result of the discrete Fourier transform;

and performing discrete cosine transform on the energy of the Mel space after filtering to obtain Mel frequency cepstrum coefficients, and taking the Mel frequency cepstrum coefficients as acoustic features corresponding to the audio data.

In one embodiment of the present disclosure, the method further comprises:

and training the plurality of voice recognition models by using voice training data and an initial label, wherein the initial label is the text of the coarsest level corresponding to the voice training data.

In one embodiment of the present disclosure, the training the plurality of speech recognition models using the speech training data and the initial label includes:

inputting voice training data into the voice recognition models to respectively obtain corresponding acoustic labels;

calculating a cost function for each obtained acoustic tag, setting corresponding weights, multiplying the cost function of each acoustic tag by the corresponding weights, and summing to obtain a total cost function;

and training the plurality of voice recognition models according to the initial label by taking the minimum total cost function as a target.

In a second aspect of embodiments of the present disclosure, there is provided a voice recognition apparatus, comprising:

the extraction module is used for extracting characteristics of the audio data to be identified so as to obtain acoustic characteristics corresponding to the audio data;

the recognition module is used for inputting the acoustic features into a plurality of pre-trained voice recognition models to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, the target probability distributions correspond to a plurality of levels of acoustic tag systems, and the corresponding target probability distributions represent the matching degree between each acoustic tag and the acoustic feature under the level of acoustic tag systems;

And the fusion module is used for carrying out fusion decoding on the multiple target probability distributions so as to obtain the identification result of the audio data.

In one embodiment of the present disclosure, the fusion module includes:

the construction submodule is used for constructing decoding paths of each voice recognition model according to the target probability distributions, the decoding paths represent the recognition process of the corresponding voice recognition models on the acoustic features, and the decoding paths obtained after the recognition is finished represent the acoustic labels recognized for the acoustic features;

and the calculating sub-module is used for calculating a decoding objective function based on the decoding paths of the voice recognition models, finding a decoding path which maximizes the decoding objective function, and taking the decoding path as a recognition result of the audio data.

In one embodiment of the present disclosure, the construction submodule includes:

The determining unit is used for determining elements in the acoustic tag system corresponding to each voice recognition model;

and the construction unit is used for constructing a decoding path of the voice recognition model by taking the element as a prefix and based on the target probability distribution corresponding to the voice recognition model.

In one embodiment of the present disclosure, the construction unit is for:

In one embodiment of the disclosure, the construction unit is specifically configured to select the candidate result of the next element as follows:

In one embodiment of the present disclosure, the computing submodule includes:

a first calculation unit configured to calculate prefix scores corresponding to the respective voice recognition models based on decoding paths of the respective voice recognition models;

and the second calculation unit is used for multiplying the prefix scores of the voice recognition models with the set corresponding weights, and then summing all the obtained products to obtain a decoding objective function.

In one embodiment of the present disclosure, the first computing unit is configured to:

In one embodiment of the present disclosure, the computing submodule is to:

In one embodiment of the present disclosure, the apparatus further comprises at least one of:

the first ending module is used for ending decoding if the statement ending mark is detected during decoding;

the second ending module is used for ending decoding if the mute mark is detected and exceeds the appointed duration during decoding;

and the third ending module is used for ending decoding if the current state accords with the appointed ending state during decoding.

In one embodiment of the present disclosure, the extraction module is configured to:

performing discrete Fourier transform on the sampling points in each window;

In one embodiment of the present disclosure, the apparatus further comprises:

and the training module is used for training the plurality of voice recognition models by using voice training data and an initial label, wherein the initial label is the text of the coarsest level corresponding to the voice training data.

In one embodiment of the present disclosure, the training module is configured to:

In a third aspect of the disclosed embodiments, a computer-readable medium is provided, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the above-described speech recognition method.

In a fourth aspect of embodiments of the present disclosure, there is provided a computing device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which processor implements the steps of the speech recognition method when the program is executed.

According to the voice recognition method and device, the voice data to be recognized are subjected to feature extraction and then input into a plurality of voice recognition models trained in advance, so that a plurality of corresponding target probability distributions are obtained, fusion decoding is carried out, and a recognition result is obtained. The recognition result is obtained by fusion decoding on the basis of recognition of a plurality of voice recognition models, the voice recognition models correspond to a plurality of levels of acoustic tag systems, and the audio data can be recognized under the plurality of levels of acoustic tag systems, so that more application scenes such as scenes with uneven data distribution of rarely used words, multi-homophones and the like can be covered, and the accuracy of Chinese voice recognition and the robustness of the models are further improved on the basis of fusion decoding.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:

FIG. 1 schematically illustrates a flow diagram first of a speech recognition method implementation according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a second flowchart of a speech recognition method implementation according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a sparse matrix schematic in accordance with an embodiment of the present disclosure;

FIG. 4 schematically illustrates a test flow diagram according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a model training implementation flow diagram in accordance with an embodiment of the present disclosure;

FIG. 6 schematically illustrates a training flow diagram according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a flow diagram of training and testing according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a media schematic for a speech recognition method according to an embodiment of the present disclosure;

FIG. 9 schematically illustrates a schematic structure of a speech recognition apparatus according to an embodiment of the present disclosure;

fig. 10 schematically illustrates a structural schematic diagram of a computing device according to an embodiment of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present disclosure and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Those skilled in the art will appreciate that embodiments of the present disclosure may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the disclosure, a voice recognition method, medium, device and computing equipment are provided.

Any number of elements in the figures are for illustration and not limitation, and any naming is used for distinction only, and not for any limiting sense.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments thereof.

Summary of The Invention

The present disclosure finds that, in the existing chinese speech recognition technology, the data augmentation method is limited by the problem that speech data sets are not comprehensive enough and can not be recognized, and when two models adopt different modeling units, the situation that the mapping space is explosively increased may occur in the multi-model secondary scoring method, which is not easy to be implemented in the actual production environment.

In view of this, the present disclosure provides a method and apparatus for speech recognition, by extracting features of audio data to be recognized to obtain acoustic features corresponding to the audio data, inputting the acoustic features into a plurality of speech recognition models trained in advance to obtain a plurality of target probability distributions corresponding to the respective speech recognition models, and fusion-decoding the plurality of target probability distributions to obtain a recognition result of the audio data. The recognition result is obtained by fusion decoding on the basis of recognition of a plurality of voice recognition models, the voice recognition models correspond to a plurality of levels of acoustic tag systems, and the audio data can be recognized under the plurality of levels of acoustic tag systems, so that more application scenes such as scenes with uneven data distribution of rarely used words, multi-homophones and the like can be covered, and the accuracy of Chinese voice recognition and the robustness of the models are further improved on the basis of fusion decoding.

Having described the basic principles of the present disclosure, various non-limiting embodiments of the present disclosure are specifically described below.

Application scene overview

The voice recognition method and the voice recognition device can be applied to the scenes of Chinese voice recognition. The Chinese voice recognition has wide application scenes including various scenes such as real-time voice input, intelligent voice customer service, robot dialogue, real-time conference recording, lecture subtitle same screen, classroom audio recognition and the like. The recognition of special audio data such as the rarely used word or homophone can occur in various application scenes, and the technical scheme provided by the disclosure can be used for recognizing the audio data under a plurality of levels of acoustic tag systems through a plurality of voice recognition models, so that more application scenes can be covered, including recognition scenes of the special audio data such as the rarely used word or homophone, and the accuracy of Chinese voice recognition and the robustness of the models are further improved based on fusion decoding.

Exemplary method

A voice recognition method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 1. As shown in fig. 1, the voice recognition method of the embodiment of the present disclosure includes the steps of:

S11: extracting features of the audio data to be identified to obtain acoustic features corresponding to the audio data;

s12: inputting acoustic features into a plurality of pre-trained speech recognition models to obtain a plurality of target probability distributions corresponding to the respective speech recognition models, respectively;

the target probability distribution is characterized by the matching degree between each acoustic tag and the acoustic feature under the acoustic tag system of the level;

s13: and carrying out fusion decoding on the multiple target probability distributions to obtain the recognition result of the audio data.

Through the above process, the embodiment of the disclosure can identify the audio data under the acoustic tag system of a plurality of levels through a plurality of voice recognition models, can cover more application scenes, including recognition scenes with uneven data distribution such as rarely used words or homophones, and further improves the accuracy of Chinese voice recognition and the robustness of the models based on fusion decoding.

The multiple levels of acoustic tagging system to which embodiments of the present disclosure relate may include at least two of: text (word) level acoustic taggant, syllable (syllabic) level acoustic taggant, phone (phone) level acoustic taggant, and phone (context-dependent phone) level taggant with contextual background information.

The acoustic label of the word level simultaneously contains acoustic information and linguistic information, and can be expressed as that the same pronunciation corresponds to different Chinese characters in different contexts. The acoustic labels of syllable level contain pure acoustic information, and can be expressed as that the acoustic labels of syllable level corresponding to the same pronunciation are completely consistent, for example, the acoustic labels of "yes" and "city" in the audio data are corresponding to the acoustic label of syllable level "shi4". The phoneme-level acoustic tag contains acoustic information with smaller granularity, corresponding to acoustic feature sequences with shorter duration, such as initials or finals. The acoustic tag with the phoneme level of the context information contains the phoneme context information in addition to the phoneme information, has finer granularity and can reflect the acoustic characteristics of the audio data in more detail.

For example, the audio data is "the classmates have class", the acoustic label marked under the acoustic label system of the text level is the text "the classmates have class", the acoustic label marked under the acoustic label system of the syllable level is the syllable "tolg 2 xue men5 shang4 ke4 le5", and the acoustic label marked under the acoustic label system of the phoneme level is the factor "t ong2 x ue2 m en5 sh ang4 k e4l e", so that the mapping relation of the audio data corresponding to the acoustic labels of different levels can be learned through a plurality of voice recognition models. When the pronunciation 'same' is detected in the audio data, the three voice recognition models output target probability distributions of 'same', 'tolg 2' and't ong 2', so that fusion decoding can be carried out according to the three target probability distributions, and a recognition result of the audio data is obtained.

In one possible embodiment, S11 may include:

performing discrete Fourier transform on the sampling points in each window;

calculating to obtain the energy of the Mel space according to the result of the discrete Fourier transform;

the energy in mel space is filtered and then discrete cosine transformed to obtain mel frequency cepstrum coefficients, and Mel Frequency Cepstrum Coefficients (MFCCs) are used as acoustic features corresponding to the audio data.

The process of feature extraction is described below with a specific example. For example, a window with a duration of 25ms is specified, an interval of 10ms, and a feature extraction process is performed on audio data to be identified as follows:

1) The audio data to be identified is cut into window segments of 25ms in time, taking the audio of 16000 sampling rate as an example, and the window segments contain 0.025×16000=400 sampling points; the interval between windows is 10ms, allowing overlapping between windows;

2) Discrete Fourier Transform (DFT) is performed on the sample points within each window, specifically as follows:

wherein S is _i (k) For the DFT result, h (N) is a Hamming window of length N, and K is the length of the DFT.

3) And calculating the energy of the melmel space according to the result of the discrete Fourier transform, wherein the formula is as follows:

Wherein S is _i (k) P as DFT result _i (k) Is the energy of mel space;

4) The energy in mel space is filtered and then subjected to Discrete Cosine Transform (DCT) to obtain Mel Frequency Cepstrum Coefficients (MFCCs), which are acoustic features corresponding to the audio data.

The extraction mode of the MFCC acoustic features is based on Hamming window sampling and DCT calculation, so that acoustic feature extraction is effectively realized, extraction accuracy is improved, and powerful data support is provided for voice recognition.

In one possible embodiment, the above method further comprises at least one of:

if the statement end mark is detected during decoding, ending decoding;

The various decoding ending modes enrich means for controlling decoding ending, and specific implementation modes can be flexibly set according to the needs in practical application, so that the method is convenient and quick.

In one possible embodiment, the method may further include:

According to the method for training the multiple voice recognition models based on the voice training data and the initial labels, the voice recognition models can be trained on the basis of the voice training data corresponding to the coarsest level text, so that the voice recognition models can learn the mapping relation between the audio data and the acoustic labels more easily, convergence of model training is accelerated, and training speed is improved.

In the embodiment of the disclosure, the text level, syllable level, phoneme level and phoneme level with context information are the thickest level in the four levels, the thinnest level is the text level and the phoneme level with context information. For example, the audio training data is "classmates have class", and the corresponding initial tag is the text of the coarsest level "classmates have class".

Fig. 2 schematically illustrates a flow diagram of a speech recognition method implementation according to an embodiment of the present disclosure. As shown in fig. 2, the voice recognition method of the embodiment of the present disclosure includes the steps of:

s21: extracting features of the audio data to be identified to obtain acoustic features corresponding to the audio data;

s22: inputting acoustic features into a plurality of pre-trained speech recognition models to obtain a plurality of target probability distributions corresponding to the respective speech recognition models, respectively;

The target probability distribution represents the matching degree between each acoustic tag and the acoustic feature under the acoustic tag system of the level;

in embodiments of the present disclosure, the plurality of speech recognition models described above may be implemented using a variety of structures, such as using an HMM-GMM (Hidden Markov Model-Gaussian Mixed Model, hidden Markov model-Gaussian mixture model) structure. The structure can dynamically obtain the most probable hidden state sequence through a forward and backward algorithm of the HMM, so that the acoustic tag corresponding to the audio data is obtained. For example, the audio data is "Shanghai is an international metropolitan", and the syllable-level acoustic tag "shang4 hai shi4 guo2 ji4 da4 du1 shi4" and the phoneme-level acoustic tag "sh ang4 h ai3 sh 4 g uo2 j i4 d a4 d u1 shi4" can be obtained by the speech recognition model of the above structure.

S23: determining elements in the acoustic tag system corresponding to each voice recognition model, taking the first element in the acoustic tag system corresponding to the voice recognition model as a prefix, selecting a candidate result of the next element according to the target probability distribution of the next element, and constructing a current decoding path by the prefix and the candidate result for decoding; and the like, taking the last decoding path as the current prefix in each decoding, and combining the next element to construct the current decoding path until a complete decoding path is obtained;

In one possible implementation manner, the selecting the candidate result of the next element according to the target probability distribution of the next element may include:

sorting the recognition results of the next element according to the target probability distribution from high to low, and selecting the target probability distribution of the designated number with the front sorting; and taking the identification result corresponding to the selected target probability distribution as a candidate result of the next element.

In the embodiment of the disclosure, the decoding path represents the recognition process of the acoustic feature by the corresponding speech recognition model, and the decoding path obtained after the recognition is completed represents the acoustic tag recognized by the acoustic feature.

S24: calculating prefix scores corresponding to the respective speech recognition models based on the decoding paths of the respective speech recognition models;

in one possible implementation, S24 may specifically include:

The prefix score of the speech recognition model can be expressed by the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device, Prefix score for speech recognition model, X is audio data,/for speech recognition model>As prefix, v is the candidate result of the next element,/->For the current decoding path constructed by the prefix and the candidate result,/>The selection rate of the current decoding path can be directly obtained from the output of the corresponding speech recognition model.

S25: multiplying the prefix scores of the voice recognition models with the set corresponding weights, and then summing all obtained products to obtain a decoding objective function;

the decoding objective function is used for obtaining an optimal decoding path from a plurality of decoding paths, and the decoding path which maximizes the decoding objective function is regarded as the optimal decoding path, so that the decoding path can be used as an identification result of the audio data. Specifically, the method can be expressed by the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,to decode the target functionCount (n)/(l)>For each speech recognition model, a prefix score, λ ₁ ,λ ₂ ,λ ₃ ,. the weight of each speech recognition model is preset and has lambda ₁ +λ ₂ +λ ₃ +…＝1。

S26: uniformly converting the decoding paths of each voice recognition model in the decoding objective function into decoding paths with a designated level by adopting a preset sparse matrix;

the sparse matrix is a mapping relation between acoustic tag system elements of different levels and is used for converting the acoustic tag system elements of one level into the acoustic tag system elements of another level. In specific application, a sparse matrix between any two levels of acoustic tag system elements can be set, so that the acoustic tag system elements at different levels can be conveniently converted. The decoding paths of different levels can be unified into the decoding paths of the appointed level through the conversion, so that the decoding path which enables the decoding objective function to be maximum can be found on the basis of the unified level, the identification result of the audio data is obtained, the operation efficiency is improved, and the identification result output of the appointed level is realized.

The specified level may be a text level, syllable level, phoneme level, or phoneme level with contextual background information. In general, the specified level is a level required for speech recognition output, and may be set according to actual needs, such as setting the specified level to a text level, or the like. If the plurality of speech recognition models have the speech recognition model corresponding to the designated level, the model is not required to be converted into the decoding path of the designated level, and the models of other levels are required to be converted into the decoding path of the designated level. For convenient implementation, an identity matrix can be introduced as a sparse matrix, the row and column elements of the sparse matrix are the same, only the diagonal value in the matrix is 1, and the rest value is 0, so that the element conversion of the acoustic tag system of the same level can be completed. Such as text level to text level conversion, syllable level to syllable level conversion, etc.

Through the above process, the embodiment of the disclosure can improve the accuracy of Chinese voice recognition, can be suitable for various scenes including scenes with uneven data distribution, such as rarely used words, multi-homophones and the like, and the output results of a plurality of levels of voice recognition models are explicitly considered in a fusion decoding mode, so that the problem of inaccurate recognition under the scenes with uneven data distribution can be solved to a great extent. By introducing a plurality of voice recognition models, the defect of inaccurate recognition of a single voice recognition model can be overcome, so that the effect of voice recognition is greatly improved, and the robustness of the model is greatly improved. For example, on the basis of a word-level speech recognition model, syllable-level and phoneme-level speech recognition models are added, so that recognition accuracy of rarely used words, homophones and other scenes can be greatly improved. In addition, the conversion of the decoding path is carried out based on the sparse matrix, so that the fusion of different levels of information can be rapidly realized during decoding, and the efficiency of voice recognition is further improved.

Fig. 3 schematically illustrates a sparse matrix schematic in accordance with an embodiment of the present disclosure. Referring to fig. 3, the sparse matrix is a mapping relationship between syllable-level acoustic tag system elements and literal-level acoustic tag system elements. Wherein, the row of the sparse matrix represents syllable, the column represents text, the value in the matrix is 0 or 1,0 represents that two elements corresponding to the row and the column have no mapping relation, and 1 represents that two elements corresponding to the row and the column have mapping relation. For example, in the first row and the first column, the value is 1, and the corresponding word "large" has a mapping relationship with syllable "da 4". Wherein, the polyphones have two values of "1" in the sparse matrix corresponding to the two pronunciations "du1" and "dou1", respectively.

In the embodiment of the disclosure, the plurality of voice recognition models can be regarded as a network for voice recognition as a whole, the sparse matrix can also be regarded as the last layer of the network, and conversion among acoustic tag system elements of different levels can be completed by performing linear matrix operation through the sparse matrix, so that the operation efficiency in actual use is improved.

S27: and finding a decoding path which maximizes the decoding objective function, and taking the decoding path as a recognition result of the audio data at the designated level.

Fig. 4 schematically illustrates a test flow diagram according to an embodiment of the present disclosure. Referring to fig. 4, the audio data to be recognized is subjected to feature extraction to obtain acoustic features, and the acoustic features are respectively input into 3 different levels of speech recognition models 1, 2 and 3 to obtain respective corresponding target probability distribution 1, 2 and 3. The target probability distribution of the three voice recognition models respectively corresponds to acoustic tag systems with different levels, and the matching degree between the acoustic tags and the acoustic features under the corresponding acoustic tag systems is respectively represented. The different levels may be set to any three of a text level, syllable level, phoneme level, or phoneme level with contextual background information as desired. Finally, fusion decoding is carried out on the three target probability distributions with different levels to obtain the recognition result of the audio data, so that Chinese voice recognition based on a plurality of voice recognition models is realized, more application scenes including recognition scenes with uneven data distribution such as rarely used words or homophones can be covered, and the accuracy of Chinese voice recognition and the robustness of the models are further improved based on fusion decoding.

FIG. 5 schematically illustrates a model training implementation flow diagram according to an embodiment of the present disclosure. As shown in fig. 5, the voice recognition method according to the embodiment of the present disclosure further includes the steps of:

s51: respectively obtaining corresponding acoustic labels from the input voice training data and the plurality of voice recognition models;

s52: calculating a cost function for each obtained acoustic tag and setting a corresponding weight;

s53: summing the cost function of each acoustic tag with the corresponding weight product to obtain a total cost function;

the total cost function may be expressed as follows:

L＝α ₁ l ₁ +α ₂ l ₂ +α ₃ l ₃ +…；

wherein l ₁ ,l ₂ ,l ₃ ,. the cost functions, alpha, of the individual acoustic tags, respectively ₁ ,α ₂ ,α ₃ ,. the weights of the cost functions can be set as required to satisfy α ₁ +α ₂ +α ₃ +…＝1。

S54: and training a plurality of voice recognition models according to the initial label by taking the minimum total cost function as a target.

The initial label is the text of the coarsest level corresponding to the voice training data.

Through the above process, the embodiment of the disclosure can improve the accuracy of Chinese voice recognition, can be suitable for various scenes, including scenes with uneven data distribution, such as rarely used words, multi-homophones and the like, and can learn different levels of corresponding relations between acoustic features and acoustic labels based on a plurality of voice recognition models for multi-task mode training, so that the robustness of the models to different scenes is greatly improved, and the situation that the rarely used words, multi-homophones and the like cannot be distinguished and covered by the acoustic labels at the word level can be relieved to a great extent.

Fig. 6 schematically illustrates a training flow diagram according to an embodiment of the present disclosure. Referring to fig. 6, feature extraction is performed on audio data to be trained to obtain acoustic features, and the acoustic features are respectively input into a plurality of voice recognition models such as a voice recognition model 1 and a voice recognition model 2 of different levels to obtain a plurality of acoustic labels such as an acoustic label 1 and an acoustic label 2 which are respectively corresponding to each other. Each voice recognition model corresponds to one level of acoustic tag system, and the obtained acoustic tags are acoustic tags under the corresponding level of acoustic tag system, so that a plurality of levels of acoustic tags can be obtained. The plurality of speech recognition models are then trained using the initial labels and the plurality of levels of acoustic labels, such that correspondence between acoustic features and the plurality of levels of acoustic labels can be learned. The training mode enriches the learning content of the model, enhances the learning capacity of the model, and enables the trained model to be suitable for wider application scenes.

Fig. 7 schematically illustrates a flow diagram of training and testing according to an embodiment of the present disclosure. Referring to fig. 7, the voice recognition method provided in the embodiment of the present disclosure includes two stages: a training phase and a testing phase. In the training stage, voice training data are input into a plurality of voice recognition models to obtain respective corresponding acoustic labels, and the voice recognition models respectively correspond to acoustic label systems of different levels. Training the plurality of voice recognition models based on the plurality of obtained acoustic labels, so as to learn the corresponding relation between the acoustic characteristics and the acoustic labels with different levels. In the test stage, the audio data to be recognized are input into each trained voice recognition model to obtain respective corresponding target probability distribution. The obtained target probability distributions respectively correspond to acoustic tag systems with different levels, and the matching degree between the acoustic tags and the acoustic features under the corresponding acoustic tag systems is respectively represented. Finally, fusion decoding is carried out on the target probability distributions with different levels to obtain the recognition result of the audio data, so that Chinese voice recognition based on a plurality of voice recognition models is realized, more application scenes can be covered, and the accuracy of Chinese voice recognition and the robustness of the models are further improved based on fusion decoding.

The above method provided by the embodiments of the present disclosure may be compared with a speech recognition method in the prior art, and the results may be shown in table 1 below. It can be seen that the voice recognition accuracy in the prior art is low, and more recognition errors exist, and the recognition accuracy of the embodiment of the disclosure is very high. Therefore, the voice recognition accuracy under the scene of uneven data distribution such as homophones can be well improved.

TABLE 1

Exemplary Medium

Having described the method of an exemplary embodiment of the present disclosure, next, a medium of an exemplary embodiment of the present disclosure will be described with reference to fig. 8.

In some possible implementations, aspects of the present disclosure may also be implemented as a computer-readable medium having a program stored thereon, which when executed by a processor is configured to implement the steps in the speech recognition method according to the various exemplary embodiments of the present disclosure described in the "exemplary methods" section of the present specification.

Specifically, the processor is configured to implement the following steps when executing the program:

extracting features of audio data to be identified to obtain acoustic features corresponding to the audio data; inputting the acoustic features into a plurality of pre-trained voice recognition models to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, wherein the target probability distributions correspond to a plurality of levels of acoustic tag systems, and the corresponding target probability distributions represent the matching degree between each acoustic tag and the acoustic feature under the level of acoustic tag systems; and carrying out fusion decoding on the target probability distributions to obtain the identification result of the audio data.

It should be noted that: the medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As shown in fig. 8, a medium 80 is depicted that may employ a portable compact disc read only memory (CD-ROM) and that includes a program and that may run on a device, in accordance with an embodiment of the present disclosure. However, the disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take many forms, including, but not limited to: electromagnetic signals, optical signals, or any suitable combination of the preceding. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the context of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).

Exemplary apparatus

Having described the media of the exemplary embodiments of the present disclosure, next, an apparatus of the exemplary embodiments of the present disclosure will be described with reference to fig. 9.

As shown in fig. 9, a voice recognition apparatus of an embodiment of the present disclosure may include:

an extracting module 901, configured to perform feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data;

the recognition module 902 is configured to input acoustic features into a plurality of pre-trained speech recognition models to obtain a plurality of target probability distributions corresponding to the respective speech recognition models, where the plurality of target probability distributions correspond to a plurality of levels of acoustic tag systems, and the corresponding target probability distributions characterize a degree of matching between each acoustic tag and the acoustic feature under the level of acoustic tag systems;

And the fusion module 903 is configured to perform fusion decoding on the multiple target probability distributions to obtain an identification result of the audio data.

In one possible embodiment, the plurality of levels of acoustic tag systems described above include at least two of: a text-level acoustic taggant, a syllable-level acoustic taggant, a phone-level acoustic taggant, and a phone-level taggant with contextual background information.

In one possible embodiment, the fusion module includes:

the construction submodule is used for constructing decoding paths of each voice recognition model according to a plurality of target probability distributions, the decoding paths represent the recognition process of the corresponding voice recognition models on acoustic features, and the decoding paths obtained after the recognition is finished represent acoustic tags recognized for the acoustic features;

and the computing sub-module is used for computing a decoding objective function based on the decoding paths of the voice recognition models, finding a decoding path which maximizes the decoding objective function, and taking the decoding path as a recognition result of the audio data.

In one possible embodiment, the above construction submodule includes:

In one possible embodiment, the above construction unit is for:

selecting a candidate result of a next element according to target probability distribution of the next element by taking a first element in an acoustic label system corresponding to the voice recognition model as a prefix, and constructing a current decoding path by the prefix and the candidate result for decoding;

In a possible implementation manner, the above construction unit is specifically configured to select the candidate result of the next element as follows:

sorting the recognition results of the next element according to the target probability distribution from high to low, and selecting the target probability distribution of the designated number with the front sorting;

In one possible implementation manner, the calculating submodule includes:

A first calculation unit for calculating prefix scores corresponding to the respective voice recognition models based on decoding paths of the respective voice recognition models;

and the second calculation unit is used for multiplying the prefix scores of the voice recognition models with the set corresponding weights, and then summing all the obtained products to obtain the decoding objective function.

In a possible implementation manner, the first computing unit is configured to:

In a possible implementation manner, the calculation submodule is used for:

uniformly converting decoding paths of each voice recognition model in a decoding objective function into decoding paths of designated levels by adopting a preset sparse matrix, wherein the sparse matrix is a mapping relation among acoustic tag system elements of different levels and is used for converting acoustic tag system elements of one level into acoustic tag system elements of another level;

In one possible implementation, the specified level is a text level, a syllable level, a phoneme level, or a phoneme level with contextual background information.

In one possible embodiment, the apparatus further comprises at least one of:

In one possible implementation manner, the extraction module is used for:

performing discrete Fourier transform on the sampling points in each window;

the energy of the mel space is filtered and then discrete cosine transformed to obtain mel frequency cepstrum coefficients, and the mel frequency cepstrum coefficients are used as acoustic features corresponding to the audio data.

In one possible embodiment, the apparatus further includes:

In one possible implementation manner, the training module is used for:

respectively obtaining corresponding acoustic labels from the input voice training data and the plurality of voice recognition models;

and training a plurality of voice recognition models according to the initial label by taking the minimum total cost function as a target.

According to the device provided by the embodiment of the disclosure, the audio data to be recognized is subjected to feature extraction and then is input into a plurality of pre-trained voice recognition models, so that a plurality of corresponding target probability distributions are obtained, fusion decoding is performed, and a recognition result is obtained. The recognition result is obtained by fusion decoding on the basis of recognition of a plurality of voice recognition models, the voice recognition models correspond to a plurality of levels of acoustic tag systems, and the audio data can be recognized under the plurality of levels of acoustic tag systems, so that more application scenes such as scenes with uneven data distribution of rarely used words, multi-homophones and the like can be covered, and the accuracy of Chinese voice recognition and the robustness of the models are further improved on the basis of fusion decoding.

Exemplary computing device

Having described the methods, media, and apparatus of exemplary embodiments of the present disclosure, a computing device of exemplary embodiments of the present disclosure is next described with reference to fig. 10.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

In some possible implementations, a computing device according to embodiments of the present disclosure may include at least one processing unit and at least one storage unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps in the speech recognition method according to the various exemplary embodiments of the present disclosure described in the section "exemplary method" above in the present specification.

A computing device 100 according to such an implementation of the present disclosure is described below with reference to fig. 10. The computing device 100 shown in fig. 10 is only one example and should not be taken as limiting the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 10, the computing device 100 is in the form of a general purpose computing device. Components of computing device 100 may include, but are not limited to: the at least one processing unit 1001 and the at least one memory unit 1002 are connected to a bus 1003 that connects different system components (including the processing unit 1001 and the memory unit 1002).

Bus 1003 includes a data bus, a control bus, and an address bus.

The storage unit 1002 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 10021 and/or cache memory 10022, and may further include readable media in the form of non-volatile memory, such as Read Only Memory (ROM) 10023.

The storage unit 1002 may also include a program/utility 10025 having a set (at least one) of program modules 10024, such program modules 10024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Computing device 100 may also communicate with one or more external devices 1004 (e.g., keyboard, pointing device, etc.). Such communication may occur through an input/output (I/O) interface 1005. Moreover, computing device 100 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet via network adapter 1006. As shown in fig. 10, the network adapter 1006 communicates with other modules of the computing device 100 over the bus 1003. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in connection with computing device 100, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

It should be noted that although in the above detailed description several units/modules or sub-units/sub-modules of the speech recognition device are mentioned, such a division is only exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Furthermore, although the operations of the methods of the present disclosure are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that this disclosure is not limited to the particular embodiments disclosed nor does it imply that features in these aspects are not to be combined to benefit from this division, which is done for convenience of description only. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of speech recognition, comprising:

calculating a decoding objective function based on the decoding paths of the respective speech recognition models;

2. The method of claim 1, wherein the plurality of levels of acoustic tag systems comprise at least two of: a text-level acoustic taggant, a syllable-level acoustic taggant, a phone-level acoustic taggant, and a phone-level taggant with contextual background information.

3. The method of claim 1, wherein constructing a decoding path for each speech recognition model from the plurality of target probability distributions comprises:

4. A method according to claim 3, wherein constructing a decoding path of the speech recognition model based on the target probability distribution corresponding to the speech recognition model with elements as prefixes comprises:

5. The method of claim 4, wherein selecting the candidate result for the next element based on the target probability distribution for the next element comprises:

6. The method of claim 1, wherein said calculating a decoding objective function based on the decoding paths of the respective speech recognition models comprises:

7. The method of claim 6, wherein the calculating prefix scores for the respective speech recognition models based on the decoding paths of the respective speech recognition models comprises:

8. The method of claim 1, wherein the specified level is a text level, a syllable level, a phone level, or a phone level with contextual background information.

9. The method of claim 1, further comprising at least one of:

if the statement end mark is detected during decoding, ending decoding;

10. The method of claim 1, wherein the feature extraction of the audio data to be identified to obtain acoustic features corresponding to the audio data comprises:

performing discrete Fourier transform on the sampling points in each window;

11. The method according to claim 1, wherein the method further comprises:

12. The method of claim 11, wherein training the plurality of speech recognition models using speech training data and an initial tag comprises:

13. A speech recognition apparatus, comprising:

a fusion module, comprising:

the computing sub-module is used for computing a decoding objective function based on the decoding paths of the voice recognition models, uniformly converting the decoding paths of the voice recognition models in the decoding objective function into decoding paths of designated levels by adopting a preset sparse matrix, wherein the sparse matrix is a mapping relation among acoustic tag system elements of different levels and is used for converting acoustic tag system elements of one level into acoustic tag system elements of another level; and finding a decoding path which maximizes the decoding objective function, and taking the decoding path as a recognition result of the audio data at the designated level.

14. The apparatus of claim 13, wherein the plurality of levels of acoustic tag systems comprise at least two of: a text-level acoustic taggant, a syllable-level acoustic taggant, a phone-level acoustic taggant, and a phone-level taggant with contextual background information.

15. The apparatus of claim 13, wherein the construction submodule comprises:

16. The apparatus of claim 15, wherein the construction unit is configured to:

17. The apparatus according to claim 16, wherein the construction unit is specifically configured to select the candidate result of the next element in the following manner:

18. The apparatus of claim 13, wherein the computing submodule comprises:

19. The apparatus of claim 18, wherein the first computing unit is configured to:

20. The apparatus of claim 13, wherein the specified level is a text level, a syllable level, a phone level, or a phone level with contextual background information.

21. The apparatus of claim 13, wherein the apparatus further comprises at least one of:

22. The apparatus of claim 13, wherein the extraction module is configured to:

performing discrete Fourier transform on the sampling points in each window;

23. The apparatus of claim 13, wherein the apparatus further comprises:

24. The apparatus of claim 23, wherein the training module is configured to:

25. A medium storing a computer program, which when executed by a processor performs the method of any one of claims 1-12.

26. A computing device, comprising:

one or more processors;

a storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-12.