CN113327597A

CN113327597A - Speech recognition method, medium, device and computing equipment

Info

Publication number: CN113327597A
Application number: CN202110698074.7A
Authority: CN
Inventors: 杨震
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-08-31
Anticipated expiration: 2041-06-23
Also published as: CN113327597B

Abstract

The embodiment of the disclosure provides a voice recognition method, a medium, a device and a computing device. The method comprises the following steps: performing feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data; inputting the acoustic characteristics into a plurality of pre-trained voice recognition models to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, and performing fusion decoding on the target probability distributions to obtain a recognition result of the audio data. The embodiment of the invention can improve the accuracy of Chinese speech recognition, can be suitable for various scenes and improves the robustness of the model.

Description

Speech recognition method, medium, device and computing equipment

Technical Field

Embodiments of the present disclosure relate to the field of speech recognition technologies, and in particular, to a method, medium, apparatus, and computing device for speech recognition based on multiple models.

Background

This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be relevant prior art by inclusion in this section.

At present, the problem of inaccurate recognition often occurs in a Chinese speech recognition model under certain specific scenes, including scenes that a speech segment to be recognized contains multiple homophones, rare Chinese characters, unbalanced training data distribution and the like. For example, the voice data "i has a white wash with a sweat", is easily recognized by mistake as "i has a white wash with a sweat early". The 'bathing' and the 'early' belong to homophones, the frequency of the 'early' in the training data is far higher than that of the 'bathing', and the distribution is unbalanced, so that the recognition result is not accurate enough.

The related art has two ways to solve the above problem, the first way is data augmentation, and the second way is multi-model quadratic scoring. Data augmentation refers to performing some operations on voice data which is easy to identify errors, such as volume disturbance, speech rate adjustment, spectrum masking and the like, so as to increase the diversity of the data. The multi-model secondary scoring refers to simultaneous training in a multi-task learning mode by using two models, in the recognition stage, a plurality of most probable candidate text sequences are generated by using one model, and then the plurality of candidate text sequences are subjected to secondary scoring by using the other model, so that the most probable text sequences are screened out as recognition results.

However, the first method only slightly adjusts the original audio data, the corresponding text label is not changed, and no new data label information is introduced, so that the improvement of the recognition rate of the scene can be limited, and in addition, the problem of rare words in the voice data set which originally does not exist cannot be solved. The second mode has higher requirement on the accuracy of the model for generating the candidate text sequence, and if the candidate text sequence generated by the model has no correct result, secondary scoring cannot improve the identification accuracy; moreover, when the two models adopt different modeling units, such as a Chinese character modeling unit and a pinyin modeling unit, the mapping space between the different modeling units in the secondary scoring is increased explosively, the calculated amount is huge, and the realization is difficult in the actual production environment.

Disclosure of Invention

The present disclosure is intended to provide a speech recognition method and apparatus.

In a first aspect of embodiments of the present disclosure, there is provided a speech recognition method, including:

performing feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data;

inputting the acoustic features into a plurality of pre-trained voice recognition models to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, wherein the target probability distributions correspond to a plurality of levels of acoustic label systems, and the corresponding target probability distributions represent the matching degree between each acoustic label and the acoustic features under the acoustic label system of the level;

and performing fusion decoding on the target probability distributions to obtain an identification result of the audio data.

In one embodiment of the present disclosure, the multiple levels of acoustic tagging architecture include at least two of: a text-level acoustic tagging scheme, a syllable-level acoustic tagging scheme, a phone-level acoustic tagging scheme, and a phone-level tagging scheme with contextual background information.

In an embodiment of the present disclosure, the performing fusion decoding on the plurality of target probability distributions to obtain an identification result of the audio data includes:

constructing a decoding path of each voice recognition model according to the target probability distributions, wherein the decoding path represents the recognition process of the corresponding voice recognition model on the acoustic characteristics, and the decoding path obtained after the recognition represents the acoustic label recognized aiming at the acoustic characteristics;

and calculating a decoding objective function based on the decoding paths of the voice recognition models, finding a decoding path which enables the decoding objective function to be maximum, and taking the decoding path as the recognition result of the audio data.

In one embodiment of the present disclosure, the constructing a decoding path of each speech recognition model according to the plurality of target probability distributions includes:

determining elements in an acoustic label system corresponding to each voice recognition model, and constructing a decoding path of the voice recognition model by taking the elements as prefixes and based on target probability distribution corresponding to the voice recognition model.

In an embodiment of the present disclosure, constructing a decoding path of the speech recognition model by using the element as a prefix and based on a target probability distribution corresponding to the speech recognition model includes:

selecting a candidate result of a next element according to the target probability distribution of the next element by taking a first element in an acoustic label system corresponding to the voice recognition model as a prefix, and constructing a current decoding path for decoding by the prefix and the candidate result;

and by analogy, in each decoding, the last decoding path is used as the current prefix, and the current decoding path is constructed by combining the next element until the complete decoding path is obtained.

In an embodiment of the disclosure, the selecting the candidate result of the next element according to the target probability distribution of the next element includes:

sorting the recognition results of the next element from high to low according to the target probability distribution, and selecting the target probability distribution with the designated number sorted in the front;

and taking the recognition result corresponding to the selected target probability distribution as the candidate result of the next element.

In an embodiment of the present disclosure, the calculating a decoding objective function based on the decoding paths of the respective speech recognition models includes:

calculating a prefix score corresponding to the respective speech recognition model based on the decoding path of the respective speech recognition model;

and multiplying the prefix scores of the voice recognition models by the set corresponding weights, and then summing all the obtained products to obtain a decoding objective function.

In one embodiment of the present disclosure, the calculating a prefix score corresponding to each of the speech recognition models based on the decoding path of each of the speech recognition models includes:

and for each voice recognition model, calculating the selection rate of the decoding path of the voice recognition model, summing the obtained selection rates of all the decoding paths, and then taking the logarithm to obtain the prefix score of the voice recognition model.

In one embodiment of the present disclosure, the finding a decoding path that maximizes the decoding objective function, as a result of the identifying the audio data, includes:

uniformly converting decoding paths of all voice recognition models in the decoding objective function into decoding paths at a specified level by adopting a preset sparse matrix, wherein the sparse matrix is a mapping relation between acoustic tag system elements at different levels and is used for converting the acoustic tag system elements at one level into the acoustic tag system elements at another level;

and finding a decoding path which maximizes the decoding objective function, and using the decoding path as the recognition result of the audio data at the specified level.

In one embodiment of the present disclosure, the designated level is a text level, a syllable level, a phone level, or a phone level with contextual background information.

In one embodiment of the present disclosure, the method further comprises at least one of:

if a sentence ending mark is detected during decoding, ending the decoding;

if the specified time length is exceeded after the mute sign is detected during decoding, the decoding is finished;

and if the current state accords with the specified ending state during decoding, ending the decoding.

In an embodiment of the present disclosure, the performing feature extraction on the audio data to be recognized to obtain an acoustic feature corresponding to the audio data includes:

sampling audio data to be identified according to a window and an interval with specified duration;

performing discrete Fourier transform on the sampling points in each window;

calculating the energy of the Mel space according to the result of the discrete Fourier transform;

and performing discrete cosine transform after filtering the energy of the Mel space to obtain Mel frequency cepstrum coefficients, and taking the Mel frequency cepstrum coefficients as acoustic features corresponding to the audio data.

In one embodiment of the present disclosure, the method further comprises:

and training the plurality of voice recognition models by using voice training data and initial labels, wherein the initial labels are texts at the coarsest level corresponding to the voice training data.

In one embodiment of the present disclosure, the training the plurality of speech recognition models using speech training data and initial labels includes:

inputting voice training data into the voice recognition models to respectively obtain corresponding acoustic labels;

calculating a cost function for each obtained acoustic label, setting corresponding weight, and summing the products of the cost function of each acoustic label and the corresponding weight to obtain a total cost function;

and training the plurality of voice recognition models by taking the minimum total cost function as a target according to the initial label.

In a second aspect of embodiments of the present disclosure, there is provided a speech recognition apparatus comprising:

the extraction module is used for extracting the characteristics of the audio data to be identified so as to obtain the acoustic characteristics corresponding to the audio data;

the recognition module is used for inputting the acoustic features into a plurality of pre-trained voice recognition models so as to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, wherein the target probability distributions correspond to a plurality of levels of acoustic label systems, and the corresponding target probability distributions represent the matching degree between each acoustic label and the acoustic features under the acoustic label system of the level;

and the fusion module is used for performing fusion decoding on the target probability distributions to obtain the identification result of the audio data.

In one embodiment of the present disclosure, the fusion module includes:

the construction submodule is used for constructing a decoding path of each voice recognition model according to the target probability distributions, the decoding path represents the recognition process of the corresponding voice recognition model on the acoustic features, and the decoding path obtained after the recognition is finished represents the acoustic label recognized aiming at the acoustic features;

and the calculation submodule is used for calculating a decoding objective function based on the decoding path of each voice recognition model, finding out the decoding path which enables the decoding objective function to be maximum, and taking the decoding path as the recognition result of the audio data.

In one embodiment of the disclosure, the construction submodule includes:

the determining unit is used for determining elements in the acoustic label system corresponding to each voice recognition model;

and the construction unit is used for constructing a decoding path of the voice recognition model by taking the element as a prefix and based on the target probability distribution corresponding to the voice recognition model.

In one embodiment of the present disclosure, the construction unit is configured to:

In an embodiment of the disclosure, the constructing unit is specifically configured to select the candidate result of the next element as follows:

In one embodiment of the present disclosure, the calculation submodule includes:

a first calculation unit configured to calculate a prefix score corresponding to each of the speech recognition models based on the decoding path of the each of the speech recognition models;

and the second calculation unit is used for multiplying the prefix scores of the voice recognition models by the set corresponding weights and then summing all the obtained products to obtain a decoding objective function.

In one embodiment of the present disclosure, the first computing unit is configured to:

In one embodiment of the disclosure, the computation submodule is configured to:

In one embodiment of the present disclosure, the apparatus further comprises at least one of:

the first end module is used for ending the decoding if a sentence end mark is detected during the decoding;

the second ending module is used for ending the decoding if the specified time length is exceeded after the mute sign is detected during the decoding;

and the third ending module is used for ending the decoding if the current state accords with the specified ending state during the decoding.

In one embodiment of the present disclosure, the extraction module is configured to:

performing discrete Fourier transform on the sampling points in each window;

In one embodiment of the present disclosure, the apparatus further comprises:

and the training module is used for training the plurality of voice recognition models by using voice training data and initial labels, wherein the initial labels are texts at the coarsest level corresponding to the voice training data.

In one embodiment of the disclosure, the training module is to:

In a third aspect of embodiments of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the above-mentioned speech recognition method.

In a fourth aspect of embodiments of the present disclosure, there is provided a computing device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the speech recognition method when executing the program.

According to the voice recognition method and the voice recognition device, the voice data to be recognized are subjected to feature extraction and then input into a plurality of pre-trained voice recognition models, corresponding target probability distributions are obtained, fusion decoding is carried out, and therefore a recognition result is obtained. The recognition result is obtained by fusion decoding on the basis of recognition of a plurality of speech recognition models, and the speech recognition models correspond to a plurality of levels of acoustic label systems and can recognize audio data under the acoustic label systems of the plurality of levels, so that more application scenes can be covered, such as scenes with uneven data distribution of rare words, multiple homophones and the like, and the accuracy of Chinese speech recognition and the robustness of the models are further improved on the basis of the fusion decoding.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically shows a first implementation flow diagram of a speech recognition method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart for implementing a speech recognition method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a sparse matrix schematic according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a test flow diagram according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a model training implementation flow diagram according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a training flow diagram according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a flow diagram of training and testing according to an embodiment of the present disclosure;

FIG. 8 schematically shows a medium diagram for a speech recognition method according to an embodiment of the present disclosure;

FIG. 9 schematically shows a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present disclosure;

fig. 10 schematically illustrates a structural diagram of a computing device according to an embodiment of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the disclosure, a speech recognition method, a medium, an apparatus and a computing device are provided.

In this document, any number of elements in the drawings is by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.

Summary of The Invention

The public people find that in the existing Chinese speech recognition technology, a data augmentation mode is limited by the fact that a speech data set is not comprehensive enough and the problem that recognition cannot be achieved possibly occurs, when two models adopt different modeling units, a multi-model secondary scoring mode possibly causes the situation that a mapping space is increased explosively, and the data augmentation mode is not easy to achieve in an actual production environment.

In view of this, the present disclosure provides a speech recognition method and apparatus, which perform feature extraction on audio data to be recognized to obtain acoustic features corresponding to the audio data, input the acoustic features into a plurality of pre-trained speech recognition models to respectively obtain a plurality of target probability distributions corresponding to the respective speech recognition models, and perform fusion decoding on the plurality of target probability distributions to obtain a recognition result of the audio data. The recognition result is obtained by fusion decoding on the basis of recognition of a plurality of speech recognition models, and the speech recognition models correspond to a plurality of levels of acoustic label systems and can recognize audio data under the acoustic label systems of the plurality of levels, so that more application scenes can be covered, such as scenes with uneven data distribution of rare words, multiple homophones and the like, and the accuracy of Chinese speech recognition and the robustness of the models are further improved on the basis of the fusion decoding.

Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.

Application scene overview

The voice recognition method and the voice recognition device can be applied to the scene of Chinese voice recognition. The application scenes of Chinese speech recognition are wide, and the application scenes comprise various scenes such as real-time speech input, intelligent speech customer service, robot conversation, real-time conference recording, simultaneous presentation and caption screen, classroom audio recognition and the like. The recognition of the special audio data such as the rarely-used words or the homophones and the like can occur in various application scenes, the technical scheme provided by the disclosure can be used for recognizing the audio data under a multi-level acoustic tag system through a plurality of voice recognition models, so that more application scenes can be covered, including the recognition scenes of the special audio data such as the rarely-used words or the homophones and the like, and the accuracy of Chinese voice recognition and the robustness of the models are further improved based on fusion decoding.

Exemplary method

A speech recognition method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 1. As shown in fig. 1, a speech recognition method according to an embodiment of the present disclosure includes the following steps:

s11: performing feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data;

s12: inputting acoustic features into a plurality of pre-trained speech recognition models to respectively obtain a plurality of target probability distributions corresponding to the speech recognition models;

the target probability distributions correspond to a plurality of levels of acoustic label systems, and the corresponding target probability distributions represent the matching degree between each acoustic label and the acoustic feature under the acoustic label system of the level;

s13: and performing fusion decoding on the plurality of target probability distributions to obtain an identification result of the audio data.

Through the process, the audio data can be recognized through the voice recognition models in the acoustic tag systems of multiple levels, more application scenes can be covered, including recognition scenes with uneven data distribution such as uncommon words or homophones and the like, and the accuracy of Chinese voice recognition and the robustness of the models are further improved based on fusion decoding.

The multi-level acoustic tag architecture related to the embodiments of the present disclosure may include at least two of: word-level acoustic tagging systems, syllable-level acoustic tagging systems, phone-level acoustic tagging systems, and phone-level tagging systems with contextual background information.

The acoustic label of the character level simultaneously contains acoustic information and linguistic information, and can be expressed that the same pronunciation corresponds to different Chinese characters in different contexts. The acoustic label of syllable level contains pure acoustic information, and can be represented as that the acoustic labels of syllable level corresponding to the same pronunciation are completely consistent, for example, the acoustic label "shi 4" of syllable level is corresponding to "yes" and "city" in the audio data. The phoneme-level acoustic tags contain acoustic information of smaller granularity, corresponding to a sequence of acoustic features of shorter duration, such as initials or finals. The acoustic label of the phoneme level with the context background information contains the phoneme context background information besides the phoneme information, has finer granularity and can reflect the acoustic characteristics of the audio data in more detail.

For example, the audio data is "classmates are class", the acoustic label labeled under the acoustic label system at the text level is the text "classmates are class", the acoustic label labeled under the acoustic label system at the syllable level is the syllable "tong 2 xue2 men5 shang4 ke4 le 5", and the acoustic label labeled under the acoustic label system at the phoneme level is the factor "t ong2 x ue2 m en5 sh ang4 k e4l e 5", so that the mapping relationship of the audio data corresponding to the acoustic labels at different levels can be learned through a plurality of speech recognition models. When the pronunciation 'same' is detected in the audio data, the three speech recognition models respectively output the target probability distributions of 'same', 'tong 2' and't ong 2', and then fusion decoding can be performed according to the three target probability distributions to obtain the recognition result of the audio data.

In one possible implementation, S11 may include:

performing discrete Fourier transform on the sampling points in each window;

the energy of the Mel space is filtered and then discrete cosine transformed to obtain Mel frequency cepstrum coefficient, and the Mel Frequency Cepstrum Coefficient (MFCC) is used as acoustic characteristics corresponding to the audio data.

The following describes the feature extraction process with a specific example. For example, a window with a duration of 25ms is specified, the interval is 10ms, and the feature extraction process of the audio data to be recognized is as follows:

1) the audio data to be identified is cut into 25ms window segments according to time, for example, the audio with a sampling rate of 16000 includes 400 sampling points 0.025 × 16000; the interval between the windows is 10ms, and the windows are allowed to overlap;

2) performing Discrete Fourier Transform (DFT) on the sampling points in each window, wherein the specific formula is as follows:

wherein S_i(k) For the DFT result, h (N) is the Hamming window of length N, and K is the length of the DFT.

3) And calculating the energy of the Mel spatial according to the result of the discrete Fourier transform, wherein the formula is as follows:

wherein S is_i(k) As a result of DFT, P_i(k) Energy in mel space;

4) the energy in the Mel space is filtered and then Discrete Cosine Transformed (DCT) is carried out to obtain Mel Frequency Cepstrum Coefficient (MFCC), and the MFCC is taken as an acoustic feature corresponding to the audio data.

The extraction mode of the MFCC acoustic features is based on Hamming window sampling and DCT calculation, so that the acoustic feature extraction is effectively realized, the extraction accuracy is improved, and powerful data support is provided for voice recognition.

In a possible embodiment, the method further comprises at least one of:

if a sentence ending mark is detected during decoding, ending the decoding;

The various decoding ending modes enrich the means for controlling the decoding ending, can flexibly set specific implementation modes according to the requirements in practical application, and are convenient and quick.

In a possible implementation, the method may further include:

the plurality of speech recognition models are trained using speech training data and an initial label, the initial label being a coarsest level of text corresponding to the speech training data.

The method for training the plurality of voice recognition models based on the voice training data and the initial labels can be used for training on the basis of the voice training data corresponding to the coarsest level text, so that the voice recognition models can learn the mapping relation between the audio data and the acoustic labels more easily, convergence of model training is accelerated, and training speed is increased.

In the embodiment of the present disclosure, the text level, the syllable level, the phone level, and the phone level with the context information are provided, and the coarsest level among the four levels is the text level, and the finest level is the phone level with the context information. For example, the audio training data is "classmates are class", and the corresponding initial label is the coarsest level text "classmates are class".

Fig. 2 schematically shows a flow chart of a speech recognition method implementation according to an embodiment of the present disclosure. As shown in fig. 2, the speech recognition method of the embodiment of the present disclosure includes the following steps:

s21: performing feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data;

s22: inputting acoustic features into a plurality of pre-trained speech recognition models to respectively obtain a plurality of target probability distributions corresponding to the speech recognition models;

in the embodiment of the present disclosure, the plurality of speech recognition models may be implemented using various structures, such as a HMM-GMM (Hidden Markov Model-Gaussian mixture Model) structure. By using the structure, the most possible hidden state sequence can be obtained dynamically through a forward and backward algorithm of the HMM, so that the acoustic label corresponding to the audio data is obtained. For example, the audio data is "Shanghai is an International metropolitan city", and the acoustic labels "shang 4 hai3 shi4 guo2 ji4 da4 du1 shi 4" at the syllable level and "sh ang4 h ai3 sh i4 g uo2 j i4 d a4 d u1 shi 4" at the phoneme level can be obtained by the above-structured speech recognition model.

S23: determining elements in an acoustic label system corresponding to each voice recognition model, selecting a candidate result of a next element according to target probability distribution of the next element by taking a first element in the acoustic label system corresponding to the voice recognition model as a prefix, and constructing a current decoding path by the prefix and the candidate result for decoding; in the same way, in each decoding, the last decoding path is used as the current prefix, and the current decoding path is constructed by combining the next element until the complete decoding path is obtained;

in a possible implementation, the selecting the candidate result of the next element according to the target probability distribution of the next element may include:

sorting the recognition results of the next element from high to low according to the target probability distribution, and selecting the target probability distribution with the designated number in the front sorting; and taking the recognition result corresponding to the selected target probability distribution as the candidate result of the next element.

In the embodiment of the present disclosure, the decoding path represents a recognition process of the voice recognition model on the acoustic feature, and the decoding path obtained after the recognition is finished represents the acoustic tag recognized for the acoustic feature.

S24: calculating a prefix score corresponding to each speech recognition model based on the decoding path of each speech recognition model;

in a possible implementation, S24 may specifically include:

The prefix score of the speech recognition model can be expressed by the following formula:

wherein the content of the first and second substances,

a prefix score for the speech recognition model, X is audio data,

for the prefix, v is the candidate for the next element,

for the current decoding path constructed from the prefix and the candidate result,

the hit rate for the current decoding path can be directly obtained from the output of the corresponding speech recognition model.

S25: multiplying the prefix score of each voice recognition model by the set corresponding weight, and then summing all the obtained products to obtain a decoding objective function;

the decoding objective function is used for solving an optimal decoding path from a plurality of decoding paths, and the decoding objective function is realized in a manner that the decoding path with the maximum decoding objective function is found to be the optimal decoding path, and then the optimal decoding path can be used as the recognition result of the audio data. Specifically, the following formula can be used to represent:

wherein the content of the first and second substances,

in order to decode the objective function(s),

for each of the speech recognition models a prefix score, λ₁,λ₂,λ₃,.. is a preset weight corresponding to each speech recognition model and has lambda₁+λ₂+λ₃+…＝1。

S26: uniformly converting decoding paths of all voice recognition models in a decoding target function into decoding paths at a specified level by adopting a preset sparse matrix;

the sparse matrix is a mapping relation between acoustic tag system elements of different levels, and is used for converting an acoustic tag system element of one level to an acoustic tag system element of another level. In specific application, a sparse matrix between any two levels of acoustic tag system elements can be set, so that the acoustic tag system elements of different levels can be conveniently converted. The decoding paths of different levels can be unified into the decoding path of the appointed level through the conversion, so that the decoding path which enables the decoding objective function to be maximum can be found on the basis of level unification, the recognition result of the audio data is obtained, the operation efficiency is improved, and the recognition result of the appointed level is output.

The specified level may be a text level, a syllable level, a phone level, or a phone level with contextual background information. In general, the designated level is a level required by the speech recognition output, and can be set according to actual needs, such as setting the designated level to a text level. If the speech recognition model corresponding to the appointed level exists in the plurality of speech recognition models, the model does not need to be converted into the decoding path of the appointed level, and the models of other levels are converted into the decoding path of the appointed level. For convenience of implementation, an identity matrix can be introduced as a sparse matrix, the row and column elements of the sparse matrix are the same, only the diagonal value in the matrix is 1, and the remaining values are 0, so that the element conversion of the acoustic tag system at the same level can be completed. Such as text level to text level conversion, syllable level to syllable level conversion, etc.

Through the process, the accuracy of Chinese speech recognition can be improved, the method and the device can be suitable for various scenes including scenes with uneven data distribution, such as scenes with uncommon words, multiple homophones and the like, the output results of speech recognition models of multiple levels are explicitly considered in a fusion decoding mode, and the problem that recognition is not accurate enough in the scenes with uneven data distribution can be solved to a great extent. The introduction of a plurality of voice recognition models can make up the defect of inaccurate recognition of a single voice recognition model, thereby greatly improving the voice recognition effect and greatly improving the robustness of the model. For example, the recognition accuracy in scenes such as uncommon words and multiple homophones can be greatly improved by adding syllable-level and phoneme-level voice recognition models on the basis of the character-level voice recognition models. In addition, the conversion of the decoding path is carried out based on the sparse matrix, so that the fusion of information of different levels can be rapidly realized during decoding, and the efficiency of voice recognition is further improved.

Fig. 3 schematically illustrates a sparse matrix schematic according to an embodiment of the present disclosure. Referring to fig. 3, the sparse matrix is a mapping relationship between acoustic label system elements at syllable level and acoustic label system elements at text level. The row of the sparse matrix represents syllables, the column represents characters, the value in the matrix is 0 or 1, 0 represents that two elements corresponding to the row and the column have no mapping relation, and 1 represents that two elements corresponding to the row and the column have mapping relation. For example, the value in the first row and column is 1, and the corresponding word "big" has a mapping relationship with the syllable "da 4". Wherein, the polyphone "all" has two values "1" in the sparse matrix, corresponding to its two pronunciations "du 1" and "dou 1", respectively.

In the embodiment of the present disclosure, the plurality of speech recognition models may be regarded as a whole as a network for speech recognition, the sparse matrix may also be regarded as the last layer of the network, and linear matrix operation is performed through the sparse matrix, so that conversion between elements of the acoustic tag system at different levels can be completed, and the operation efficiency in actual use is improved.

S27: and finding a decoding path which maximizes the decoding objective function, and using the decoding path as the recognition result of the audio data at the specified level.

Fig. 4 schematically shows a test flow diagram according to an embodiment of the present disclosure. Referring to fig. 4, the audio data to be recognized is subjected to feature extraction to obtain acoustic features, and the acoustic features are respectively input into the 3 speech recognition models 1, 2 and 3 at different levels to obtain the target probability distribution 1, the target probability distribution 2 and the target probability distribution 3 corresponding to each other. The target probability distributions of the three voice recognition models respectively correspond to acoustic label systems of different levels, and respectively represent the matching degree between the acoustic labels and the acoustic features under the corresponding acoustic label systems. The different levels may be set to any three of a text level, a syllable level, a phone level, or a phone level with contextual background information, as desired. And finally, performing fusion decoding on the target probability distributions of the three different levels to obtain the recognition result of the audio data, thereby realizing the Chinese speech recognition based on a plurality of speech recognition models, covering more application scenes including recognition scenes with uneven data distribution such as rare words or homophones and the like, and further improving the accuracy of the Chinese speech recognition and the robustness of the models based on the fusion decoding.

FIG. 5 schematically shows a flow diagram of a model training implementation according to an embodiment of the present disclosure. As shown in fig. 5, the speech recognition method of the embodiment of the present disclosure further includes the following steps:

s51: inputting voice training data into a plurality of voice recognition models to respectively obtain corresponding acoustic labels;

s52: calculating a cost function for each obtained acoustic label and setting corresponding weight;

s53: summing the products of the cost function of each acoustic label and the corresponding weight to obtain a total cost function;

the total cost function can be expressed by the following formula:

L＝α₁l₁+α₂l₂+α₃l₃+…；

wherein l₁,l₂,l₃,.. cost function, α, for each acoustic tag, respectively₁,α₂,α₃,.. respectively corresponding to the weight of each cost function, and setting specific numerical values as required to satisfy alpha₁+α₂+α₃+…＝1。

S54: and training a plurality of voice recognition models by taking the minimum total cost function as a target according to the initial label.

Wherein, the initial label is the text of the coarsest level corresponding to the voice training data.

Through the process, the accuracy of Chinese speech recognition can be improved, the method and the device can be suitable for various scenes, including scenes with uneven data distribution, such as scenes with rarely-used characters, multiple homophones and the like, and multitask mode training is carried out based on multiple speech recognition models, the corresponding relations of different levels between acoustic features and acoustic labels can be learned, the robustness of the models to different scenes is greatly improved, and the situations that the rarely-used characters, the multiple homophones and the like cannot be distinguished and covered on the acoustic labels at the character level can be relieved to a great extent.

Fig. 6 schematically shows a training flow diagram according to an embodiment of the present disclosure. Referring to fig. 6, the audio data to be trained is subjected to feature extraction to obtain acoustic features, and the acoustic features are respectively input to a plurality of speech recognition models of different levels, such as the speech recognition model 1 and the speech recognition model 2, so as to obtain a plurality of acoustic tags, such as the acoustic tag 1 and the acoustic tag 2, which correspond to each other. Each voice recognition model corresponds to one level of acoustic label system, and the obtained acoustic label is an acoustic label under the acoustic label system of the corresponding level, so that acoustic labels of multiple levels can be obtained. The plurality of speech recognition models are then trained using the initial labels and the plurality of levels of acoustic labels, so that correspondences between acoustic features and the plurality of levels of acoustic labels can be learned. The training mode enriches the learning content of the model, enhances the learning ability of the model, and enables the trained model to be suitable for wider application scenes.

Fig. 7 schematically illustrates a flow diagram of training and testing according to an embodiment of the present disclosure. Referring to fig. 7, the speech recognition method provided by the embodiment of the present disclosure includes two stages: a training phase and a testing phase. In the training stage, voice training data is input into a plurality of voice recognition models to obtain acoustic labels corresponding to the voice training data, and the voice recognition models correspond to acoustic label systems of different levels respectively. And training the plurality of voice recognition models based on the obtained plurality of acoustic labels, so as to learn the corresponding relation between the acoustic features and the acoustic labels of different levels. In the testing stage, the audio data to be recognized are input into each trained speech recognition model, and corresponding target probability distribution is obtained. The obtained target probability distributions respectively correspond to acoustic label systems of different levels, and respectively represent the matching degree between the acoustic labels and the acoustic features under the corresponding acoustic label systems. And finally, performing fusion decoding on the target probability distributions of different levels to obtain the recognition result of the audio data, thereby realizing Chinese speech recognition based on a plurality of speech recognition models, covering more application scenes, and further improving the accuracy of the Chinese speech recognition and the robustness of the models based on the fusion decoding.

The above method provided by the embodiment of the present disclosure, compared with the voice recognition method in the prior art, results can be shown in table 1 below. It can be seen that the speech recognition accuracy of the prior art is low, and there are many recognition errors, but the recognition accuracy of the embodiment of the present disclosure is very high. Therefore, the speech recognition accuracy rate under the scene of uneven data distribution such as homophones and the like can be well improved.

TABLE 1

Exemplary Medium

Having described the method of the exemplary embodiment of the present disclosure, the medium of the exemplary embodiment of the present disclosure is explained next with reference to fig. 8.

In some possible embodiments, various aspects of the disclosure may also be implemented as a computer-readable medium on which a program is stored, which, when executed by a processor, is for implementing steps in a speech recognition method according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification.

Specifically, the processor is configured to implement the following steps when executing the program:

performing feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data; inputting the acoustic features into a plurality of pre-trained voice recognition models to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, wherein the target probability distributions correspond to a plurality of levels of acoustic label systems, and the corresponding target probability distributions represent the matching degree between each acoustic label and the acoustic features under the acoustic label system of the level; and performing fusion decoding on the target probability distributions to obtain an identification result of the audio data.

It should be noted that: the above-mentioned medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As shown in fig. 8, a medium 80 that can employ a portable compact disc read only memory (CD-ROM) and include a program and can be run on a device according to an embodiment of the present disclosure is described. However, the disclosure is not so limited, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take a variety of forms, including, but not limited to: an electromagnetic signal, an optical signal, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).

Exemplary devices

Having described the media of the exemplary embodiments of the present disclosure, the apparatus of the exemplary embodiments of the present disclosure is described next with reference to fig. 9.

As shown in fig. 9, a speech recognition apparatus of an embodiment of the present disclosure may include:

an extracting module 901, configured to perform feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data;

the recognition module 902 is configured to input the acoustic features into a plurality of pre-trained speech recognition models to obtain a plurality of target probability distributions corresponding to the speech recognition models, where the plurality of target probability distributions correspond to a plurality of levels of acoustic label systems, and a corresponding target probability distribution represents a matching degree between each acoustic label and the acoustic feature in the level of the acoustic label system;

and the fusion module 903 is configured to perform fusion decoding on the multiple target probability distributions to obtain an identification result of the audio data.

In one possible embodiment, the multi-level acoustic tagging architecture includes at least two of: a text-level acoustic tagging scheme, a syllable-level acoustic tagging scheme, a phone-level acoustic tagging scheme, and a phone-level tagging scheme with contextual background information.

In a possible embodiment, the fusion module includes:

the construction submodule is used for constructing a decoding path of each voice recognition model according to the target probability distribution, the decoding path represents the recognition process of the corresponding voice recognition model on the acoustic characteristics, and the decoding path obtained after the recognition is finished represents the acoustic label recognized aiming at the acoustic characteristics;

and the calculation submodule is used for calculating a decoding target function based on the decoding path of each voice recognition model, finding out the decoding path which enables the decoding target function to be maximum, and taking the decoding path as the recognition result of the audio data.

In a possible embodiment, the above-mentioned construction submodule comprises:

In a possible embodiment, the above-mentioned construction unit is configured to:

taking a first element in an acoustic label system corresponding to the voice recognition model as a prefix, selecting a candidate result of a next element according to the target probability distribution of the next element, and constructing a current decoding path for decoding by the prefix and the candidate result;

In a possible embodiment, the above construction unit is specifically configured to select the candidate result of the next element as follows:

sorting the recognition results of the next element from high to low according to the target probability distribution, and selecting the target probability distribution with the designated number in the front sorting;

In a possible implementation, the computing submodule includes:

a first calculation unit configured to calculate a prefix score corresponding to each of the speech recognition models based on a decoding path of each of the speech recognition models;

and the second calculation unit is used for multiplying the prefix score of each voice recognition model by the set corresponding weight, and then summing all the obtained products to obtain a decoding objective function.

In a possible implementation, the first computing unit is configured to:

In a possible embodiment, the computing submodule is configured to:

uniformly converting decoding paths of all voice recognition models in a decoding target function into decoding paths of an appointed level by adopting a preset sparse matrix, wherein the sparse matrix is a mapping relation between acoustic tag system elements of different levels and is used for converting the acoustic tag system elements of one level into the acoustic tag system elements of another level;

In one possible embodiment, the designated level is a text level, a syllable level, a phone level, or a phone level with contextual background information.

In a possible embodiment, the apparatus further comprises at least one of:

In a possible implementation, the extracting module is configured to:

performing discrete Fourier transform on the sampling points in each window;

and filtering the energy of the Mel space, performing discrete cosine transform to obtain Mel frequency cepstrum coefficient, and taking the Mel frequency cepstrum coefficient as acoustic characteristics corresponding to the audio data.

In a possible embodiment, the above apparatus further comprises:

and the training module is used for training the plurality of voice recognition models by using the voice training data and the initial labels, wherein the initial labels are the coarsest level texts corresponding to the voice training data.

In a possible embodiment, the training module is configured to:

inputting voice training data into a plurality of voice recognition models to respectively obtain corresponding acoustic labels;

and training a plurality of voice recognition models by taking the minimum total cost function as a target according to the initial label.

According to the device provided by the embodiment of the disclosure, the audio data to be recognized is subjected to feature extraction and then input into the pre-trained voice recognition models, so that the corresponding target probability distributions are obtained and are subjected to fusion decoding, and thus, the recognition result is obtained. The recognition result is obtained by fusion decoding on the basis of recognition of a plurality of speech recognition models, and the speech recognition models correspond to a plurality of levels of acoustic label systems and can recognize audio data under the acoustic label systems of the plurality of levels, so that more application scenes can be covered, such as scenes with uneven data distribution of rare words, multiple homophones and the like, and the accuracy of Chinese speech recognition and the robustness of the models are further improved on the basis of the fusion decoding.

Exemplary computing device

Having described the methods, media, and apparatus of the exemplary embodiments of the present disclosure, a computing device of the exemplary embodiments of the present disclosure is described next with reference to fig. 10.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible implementations, a computing device according to embodiments of the present disclosure may include at least one processing unit and at least one memory unit. Wherein the storage unit stores program code that, when executed by the processing unit, causes the processing unit to perform the steps in the speech recognition methods according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

A computing device 100 according to such an embodiment of the present disclosure is described below with reference to fig. 10. The computing device 100 shown in fig. 10 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the disclosure.

As shown in fig. 10, computing device 100 is embodied in the form of a general purpose computing device. Components of computing device 100 may include, but are not limited to: the at least one processing unit 1001 and the at least one storage unit 1002 are connected to a bus 1003 that connects different system components (including the processing unit 1001 and the storage unit 1002).

The bus 1003 includes a data bus, a control bus, and an address bus.

The storage unit 1002 can include readable media in the form of volatile memory, such as Random Access Memory (RAM)10021 and/or cache memory 10022, and can further include readable media in the form of non-volatile memory, such as Read Only Memory (ROM) 10023.

The storage unit 1002 may also include a program/utility 10025 having a set (at least one) of program modules 10024, such program modules 10024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Computing device 100 may also communicate with one or more external devices 1004 (e.g., keyboard, pointing device, etc.). Such communication may occur via input/output (I/O) interface 1005. Moreover, computing device 100 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through network adapter 1006. As shown in FIG. 10, network adapter 1006 communicates with the other modules of computing device 100 via bus 1003. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description several units/modules or sub-units/sub-modules of the speech recognition apparatus are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A speech recognition method, comprising:

2. The method of claim 1, wherein the plurality of levels of acoustic tagging schemes include at least two of: a text-level acoustic tagging scheme, a syllable-level acoustic tagging scheme, a phone-level acoustic tagging scheme, and a phone-level tagging scheme with contextual background information.

3. The method of claim 1, wherein the performing the fusion decoding on the plurality of target probability distributions to obtain the recognition result of the audio data comprises:

4. The method of claim 3, wherein constructing a decoding path for each speech recognition model based on the plurality of target probability distributions comprises:

5. The method of claim 4, wherein constructing the decoding path of the speech recognition model by using the element as the prefix and based on the target probability distribution corresponding to the speech recognition model comprises:

6. The method of claim 3, wherein computing a decoding objective function based on the decoding paths of the respective speech recognition models comprises:

7. The method according to claim 3, wherein the finding of the decoding path that maximizes the decoding objective function, which is used as the recognition result of the audio data, comprises:

8. A speech recognition apparatus, comprising:

9. A medium storing a computer program, characterized in that the program, when being executed by a processor, carries out the method according to any one of claims 1-7.

10. A computing device, comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-7.