CN113327597A - Speech recognition method, medium, device and computing equipment - Google Patents

Speech recognition method, medium, device and computing equipment Download PDF

Info

Publication number
CN113327597A
CN113327597A CN202110698074.7A CN202110698074A CN113327597A CN 113327597 A CN113327597 A CN 113327597A CN 202110698074 A CN202110698074 A CN 202110698074A CN 113327597 A CN113327597 A CN 113327597A
Authority
CN
China
Prior art keywords
acoustic
decoding
level
voice recognition
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110698074.7A
Other languages
Chinese (zh)
Other versions
CN113327597B (en
Inventor
杨震
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202110698074.7A priority Critical patent/CN113327597B/en
Publication of CN113327597A publication Critical patent/CN113327597A/en
Application granted granted Critical
Publication of CN113327597B publication Critical patent/CN113327597B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

The embodiment of the disclosure provides a voice recognition method, a medium, a device and a computing device. The method comprises the following steps: performing feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data; inputting the acoustic characteristics into a plurality of pre-trained voice recognition models to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, and performing fusion decoding on the target probability distributions to obtain a recognition result of the audio data. The embodiment of the invention can improve the accuracy of Chinese speech recognition, can be suitable for various scenes and improves the robustness of the model.

Description

Speech recognition method, medium, device and computing equipment
Technical Field
Embodiments of the present disclosure relate to the field of speech recognition technologies, and in particular, to a method, medium, apparatus, and computing device for speech recognition based on multiple models.
Background
This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be relevant prior art by inclusion in this section.
At present, the problem of inaccurate recognition often occurs in a Chinese speech recognition model under certain specific scenes, including scenes that a speech segment to be recognized contains multiple homophones, rare Chinese characters, unbalanced training data distribution and the like. For example, the voice data "i has a white wash with a sweat", is easily recognized by mistake as "i has a white wash with a sweat early". The 'bathing' and the 'early' belong to homophones, the frequency of the 'early' in the training data is far higher than that of the 'bathing', and the distribution is unbalanced, so that the recognition result is not accurate enough.
The related art has two ways to solve the above problem, the first way is data augmentation, and the second way is multi-model quadratic scoring. Data augmentation refers to performing some operations on voice data which is easy to identify errors, such as volume disturbance, speech rate adjustment, spectrum masking and the like, so as to increase the diversity of the data. The multi-model secondary scoring refers to simultaneous training in a multi-task learning mode by using two models, in the recognition stage, a plurality of most probable candidate text sequences are generated by using one model, and then the plurality of candidate text sequences are subjected to secondary scoring by using the other model, so that the most probable text sequences are screened out as recognition results.
However, the first method only slightly adjusts the original audio data, the corresponding text label is not changed, and no new data label information is introduced, so that the improvement of the recognition rate of the scene can be limited, and in addition, the problem of rare words in the voice data set which originally does not exist cannot be solved. The second mode has higher requirement on the accuracy of the model for generating the candidate text sequence, and if the candidate text sequence generated by the model has no correct result, secondary scoring cannot improve the identification accuracy; moreover, when the two models adopt different modeling units, such as a Chinese character modeling unit and a pinyin modeling unit, the mapping space between the different modeling units in the secondary scoring is increased explosively, the calculated amount is huge, and the realization is difficult in the actual production environment.
Disclosure of Invention
The present disclosure is intended to provide a speech recognition method and apparatus.
In a first aspect of embodiments of the present disclosure, there is provided a speech recognition method, including:
performing feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data;
inputting the acoustic features into a plurality of pre-trained voice recognition models to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, wherein the target probability distributions correspond to a plurality of levels of acoustic label systems, and the corresponding target probability distributions represent the matching degree between each acoustic label and the acoustic features under the acoustic label system of the level;
and performing fusion decoding on the target probability distributions to obtain an identification result of the audio data.
In one embodiment of the present disclosure, the multiple levels of acoustic tagging architecture include at least two of: a text-level acoustic tagging scheme, a syllable-level acoustic tagging scheme, a phone-level acoustic tagging scheme, and a phone-level tagging scheme with contextual background information.
In an embodiment of the present disclosure, the performing fusion decoding on the plurality of target probability distributions to obtain an identification result of the audio data includes:
constructing a decoding path of each voice recognition model according to the target probability distributions, wherein the decoding path represents the recognition process of the corresponding voice recognition model on the acoustic characteristics, and the decoding path obtained after the recognition represents the acoustic label recognized aiming at the acoustic characteristics;
and calculating a decoding objective function based on the decoding paths of the voice recognition models, finding a decoding path which enables the decoding objective function to be maximum, and taking the decoding path as the recognition result of the audio data.
In one embodiment of the present disclosure, the constructing a decoding path of each speech recognition model according to the plurality of target probability distributions includes:
determining elements in an acoustic label system corresponding to each voice recognition model, and constructing a decoding path of the voice recognition model by taking the elements as prefixes and based on target probability distribution corresponding to the voice recognition model.
In an embodiment of the present disclosure, constructing a decoding path of the speech recognition model by using the element as a prefix and based on a target probability distribution corresponding to the speech recognition model includes:
selecting a candidate result of a next element according to the target probability distribution of the next element by taking a first element in an acoustic label system corresponding to the voice recognition model as a prefix, and constructing a current decoding path for decoding by the prefix and the candidate result;
and by analogy, in each decoding, the last decoding path is used as the current prefix, and the current decoding path is constructed by combining the next element until the complete decoding path is obtained.
In an embodiment of the disclosure, the selecting the candidate result of the next element according to the target probability distribution of the next element includes:
sorting the recognition results of the next element from high to low according to the target probability distribution, and selecting the target probability distribution with the designated number sorted in the front;
and taking the recognition result corresponding to the selected target probability distribution as the candidate result of the next element.
In an embodiment of the present disclosure, the calculating a decoding objective function based on the decoding paths of the respective speech recognition models includes:
calculating a prefix score corresponding to the respective speech recognition model based on the decoding path of the respective speech recognition model;
and multiplying the prefix scores of the voice recognition models by the set corresponding weights, and then summing all the obtained products to obtain a decoding objective function.
In one embodiment of the present disclosure, the calculating a prefix score corresponding to each of the speech recognition models based on the decoding path of each of the speech recognition models includes:
and for each voice recognition model, calculating the selection rate of the decoding path of the voice recognition model, summing the obtained selection rates of all the decoding paths, and then taking the logarithm to obtain the prefix score of the voice recognition model.
In one embodiment of the present disclosure, the finding a decoding path that maximizes the decoding objective function, as a result of the identifying the audio data, includes:
uniformly converting decoding paths of all voice recognition models in the decoding objective function into decoding paths at a specified level by adopting a preset sparse matrix, wherein the sparse matrix is a mapping relation between acoustic tag system elements at different levels and is used for converting the acoustic tag system elements at one level into the acoustic tag system elements at another level;
and finding a decoding path which maximizes the decoding objective function, and using the decoding path as the recognition result of the audio data at the specified level.
In one embodiment of the present disclosure, the designated level is a text level, a syllable level, a phone level, or a phone level with contextual background information.
In one embodiment of the present disclosure, the method further comprises at least one of:
if a sentence ending mark is detected during decoding, ending the decoding;
if the specified time length is exceeded after the mute sign is detected during decoding, the decoding is finished;
and if the current state accords with the specified ending state during decoding, ending the decoding.
In an embodiment of the present disclosure, the performing feature extraction on the audio data to be recognized to obtain an acoustic feature corresponding to the audio data includes:
sampling audio data to be identified according to a window and an interval with specified duration;
performing discrete Fourier transform on the sampling points in each window;
calculating the energy of the Mel space according to the result of the discrete Fourier transform;
and performing discrete cosine transform after filtering the energy of the Mel space to obtain Mel frequency cepstrum coefficients, and taking the Mel frequency cepstrum coefficients as acoustic features corresponding to the audio data.
In one embodiment of the present disclosure, the method further comprises:
and training the plurality of voice recognition models by using voice training data and initial labels, wherein the initial labels are texts at the coarsest level corresponding to the voice training data.
In one embodiment of the present disclosure, the training the plurality of speech recognition models using speech training data and initial labels includes:
inputting voice training data into the voice recognition models to respectively obtain corresponding acoustic labels;
calculating a cost function for each obtained acoustic label, setting corresponding weight, and summing the products of the cost function of each acoustic label and the corresponding weight to obtain a total cost function;
and training the plurality of voice recognition models by taking the minimum total cost function as a target according to the initial label.
In a second aspect of embodiments of the present disclosure, there is provided a speech recognition apparatus comprising:
the extraction module is used for extracting the characteristics of the audio data to be identified so as to obtain the acoustic characteristics corresponding to the audio data;
the recognition module is used for inputting the acoustic features into a plurality of pre-trained voice recognition models so as to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, wherein the target probability distributions correspond to a plurality of levels of acoustic label systems, and the corresponding target probability distributions represent the matching degree between each acoustic label and the acoustic features under the acoustic label system of the level;
and the fusion module is used for performing fusion decoding on the target probability distributions to obtain the identification result of the audio data.
In one embodiment of the present disclosure, the multiple levels of acoustic tagging architecture include at least two of: a text-level acoustic tagging scheme, a syllable-level acoustic tagging scheme, a phone-level acoustic tagging scheme, and a phone-level tagging scheme with contextual background information.
In one embodiment of the present disclosure, the fusion module includes:
the construction submodule is used for constructing a decoding path of each voice recognition model according to the target probability distributions, the decoding path represents the recognition process of the corresponding voice recognition model on the acoustic features, and the decoding path obtained after the recognition is finished represents the acoustic label recognized aiming at the acoustic features;
and the calculation submodule is used for calculating a decoding objective function based on the decoding path of each voice recognition model, finding out the decoding path which enables the decoding objective function to be maximum, and taking the decoding path as the recognition result of the audio data.
In one embodiment of the disclosure, the construction submodule includes:
the determining unit is used for determining elements in the acoustic label system corresponding to each voice recognition model;
and the construction unit is used for constructing a decoding path of the voice recognition model by taking the element as a prefix and based on the target probability distribution corresponding to the voice recognition model.
In one embodiment of the present disclosure, the construction unit is configured to:
selecting a candidate result of a next element according to the target probability distribution of the next element by taking a first element in an acoustic label system corresponding to the voice recognition model as a prefix, and constructing a current decoding path for decoding by the prefix and the candidate result;
and by analogy, in each decoding, the last decoding path is used as the current prefix, and the current decoding path is constructed by combining the next element until the complete decoding path is obtained.
In an embodiment of the disclosure, the constructing unit is specifically configured to select the candidate result of the next element as follows:
sorting the recognition results of the next element from high to low according to the target probability distribution, and selecting the target probability distribution with the designated number sorted in the front;
and taking the recognition result corresponding to the selected target probability distribution as the candidate result of the next element.
In one embodiment of the present disclosure, the calculation submodule includes:
a first calculation unit configured to calculate a prefix score corresponding to each of the speech recognition models based on the decoding path of the each of the speech recognition models;
and the second calculation unit is used for multiplying the prefix scores of the voice recognition models by the set corresponding weights and then summing all the obtained products to obtain a decoding objective function.
In one embodiment of the present disclosure, the first computing unit is configured to:
and for each voice recognition model, calculating the selection rate of the decoding path of the voice recognition model, summing the obtained selection rates of all the decoding paths, and then taking the logarithm to obtain the prefix score of the voice recognition model.
In one embodiment of the disclosure, the computation submodule is configured to:
uniformly converting decoding paths of all voice recognition models in the decoding objective function into decoding paths at a specified level by adopting a preset sparse matrix, wherein the sparse matrix is a mapping relation between acoustic tag system elements at different levels and is used for converting the acoustic tag system elements at one level into the acoustic tag system elements at another level;
and finding a decoding path which maximizes the decoding objective function, and using the decoding path as the recognition result of the audio data at the specified level.
In one embodiment of the present disclosure, the designated level is a text level, a syllable level, a phone level, or a phone level with contextual background information.
In one embodiment of the present disclosure, the apparatus further comprises at least one of:
the first end module is used for ending the decoding if a sentence end mark is detected during the decoding;
the second ending module is used for ending the decoding if the specified time length is exceeded after the mute sign is detected during the decoding;
and the third ending module is used for ending the decoding if the current state accords with the specified ending state during the decoding.
In one embodiment of the present disclosure, the extraction module is configured to:
sampling audio data to be identified according to a window and an interval with specified duration;
performing discrete Fourier transform on the sampling points in each window;
calculating the energy of the Mel space according to the result of the discrete Fourier transform;
and performing discrete cosine transform after filtering the energy of the Mel space to obtain Mel frequency cepstrum coefficients, and taking the Mel frequency cepstrum coefficients as acoustic features corresponding to the audio data.
In one embodiment of the present disclosure, the apparatus further comprises:
and the training module is used for training the plurality of voice recognition models by using voice training data and initial labels, wherein the initial labels are texts at the coarsest level corresponding to the voice training data.
In one embodiment of the disclosure, the training module is to:
inputting voice training data into the voice recognition models to respectively obtain corresponding acoustic labels;
calculating a cost function for each obtained acoustic label, setting corresponding weight, and summing the products of the cost function of each acoustic label and the corresponding weight to obtain a total cost function;
and training the plurality of voice recognition models by taking the minimum total cost function as a target according to the initial label.
In a third aspect of embodiments of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the above-mentioned speech recognition method.
In a fourth aspect of embodiments of the present disclosure, there is provided a computing device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the speech recognition method when executing the program.
According to the voice recognition method and the voice recognition device, the voice data to be recognized are subjected to feature extraction and then input into a plurality of pre-trained voice recognition models, corresponding target probability distributions are obtained, fusion decoding is carried out, and therefore a recognition result is obtained. The recognition result is obtained by fusion decoding on the basis of recognition of a plurality of speech recognition models, and the speech recognition models correspond to a plurality of levels of acoustic label systems and can recognize audio data under the acoustic label systems of the plurality of levels, so that more application scenes can be covered, such as scenes with uneven data distribution of rare words, multiple homophones and the like, and the accuracy of Chinese speech recognition and the robustness of the models are further improved on the basis of the fusion decoding.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 schematically shows a first implementation flow diagram of a speech recognition method according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart for implementing a speech recognition method according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a sparse matrix schematic according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a test flow diagram according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a model training implementation flow diagram according to an embodiment of the present disclosure;
FIG. 6 schematically illustrates a training flow diagram according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow diagram of training and testing according to an embodiment of the present disclosure;
FIG. 8 schematically shows a medium diagram for a speech recognition method according to an embodiment of the present disclosure;
FIG. 9 schematically shows a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present disclosure;
fig. 10 schematically illustrates a structural diagram of a computing device according to an embodiment of the present disclosure.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to an embodiment of the disclosure, a speech recognition method, a medium, an apparatus and a computing device are provided.
In this document, any number of elements in the drawings is by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.
The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.
Summary of The Invention
The public people find that in the existing Chinese speech recognition technology, a data augmentation mode is limited by the fact that a speech data set is not comprehensive enough and the problem that recognition cannot be achieved possibly occurs, when two models adopt different modeling units, a multi-model secondary scoring mode possibly causes the situation that a mapping space is increased explosively, and the data augmentation mode is not easy to achieve in an actual production environment.
In view of this, the present disclosure provides a speech recognition method and apparatus, which perform feature extraction on audio data to be recognized to obtain acoustic features corresponding to the audio data, input the acoustic features into a plurality of pre-trained speech recognition models to respectively obtain a plurality of target probability distributions corresponding to the respective speech recognition models, and perform fusion decoding on the plurality of target probability distributions to obtain a recognition result of the audio data. The recognition result is obtained by fusion decoding on the basis of recognition of a plurality of speech recognition models, and the speech recognition models correspond to a plurality of levels of acoustic label systems and can recognize audio data under the acoustic label systems of the plurality of levels, so that more application scenes can be covered, such as scenes with uneven data distribution of rare words, multiple homophones and the like, and the accuracy of Chinese speech recognition and the robustness of the models are further improved on the basis of the fusion decoding.
Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.
Application scene overview
The voice recognition method and the voice recognition device can be applied to the scene of Chinese voice recognition. The application scenes of Chinese speech recognition are wide, and the application scenes comprise various scenes such as real-time speech input, intelligent speech customer service, robot conversation, real-time conference recording, simultaneous presentation and caption screen, classroom audio recognition and the like. The recognition of the special audio data such as the rarely-used words or the homophones and the like can occur in various application scenes, the technical scheme provided by the disclosure can be used for recognizing the audio data under a multi-level acoustic tag system through a plurality of voice recognition models, so that more application scenes can be covered, including the recognition scenes of the special audio data such as the rarely-used words or the homophones and the like, and the accuracy of Chinese voice recognition and the robustness of the models are further improved based on fusion decoding.
Exemplary method
A speech recognition method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 1. As shown in fig. 1, a speech recognition method according to an embodiment of the present disclosure includes the following steps:
s11: performing feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data;
s12: inputting acoustic features into a plurality of pre-trained speech recognition models to respectively obtain a plurality of target probability distributions corresponding to the speech recognition models;
the target probability distributions correspond to a plurality of levels of acoustic label systems, and the corresponding target probability distributions represent the matching degree between each acoustic label and the acoustic feature under the acoustic label system of the level;
s13: and performing fusion decoding on the plurality of target probability distributions to obtain an identification result of the audio data.
Through the process, the audio data can be recognized through the voice recognition models in the acoustic tag systems of multiple levels, more application scenes can be covered, including recognition scenes with uneven data distribution such as uncommon words or homophones and the like, and the accuracy of Chinese voice recognition and the robustness of the models are further improved based on fusion decoding.
The multi-level acoustic tag architecture related to the embodiments of the present disclosure may include at least two of: word-level acoustic tagging systems, syllable-level acoustic tagging systems, phone-level acoustic tagging systems, and phone-level tagging systems with contextual background information.
The acoustic label of the character level simultaneously contains acoustic information and linguistic information, and can be expressed that the same pronunciation corresponds to different Chinese characters in different contexts. The acoustic label of syllable level contains pure acoustic information, and can be represented as that the acoustic labels of syllable level corresponding to the same pronunciation are completely consistent, for example, the acoustic label "shi 4" of syllable level is corresponding to "yes" and "city" in the audio data. The phoneme-level acoustic tags contain acoustic information of smaller granularity, corresponding to a sequence of acoustic features of shorter duration, such as initials or finals. The acoustic label of the phoneme level with the context background information contains the phoneme context background information besides the phoneme information, has finer granularity and can reflect the acoustic characteristics of the audio data in more detail.
For example, the audio data is "classmates are class", the acoustic label labeled under the acoustic label system at the text level is the text "classmates are class", the acoustic label labeled under the acoustic label system at the syllable level is the syllable "tong 2 xue2 men5 shang4 ke4 le 5", and the acoustic label labeled under the acoustic label system at the phoneme level is the factor "t ong2 x ue2 m en5 sh ang4 k e4l e 5", so that the mapping relationship of the audio data corresponding to the acoustic labels at different levels can be learned through a plurality of speech recognition models. When the pronunciation 'same' is detected in the audio data, the three speech recognition models respectively output the target probability distributions of 'same', 'tong 2' and't ong 2', and then fusion decoding can be performed according to the three target probability distributions to obtain the recognition result of the audio data.
In one possible implementation, S11 may include:
sampling audio data to be identified according to a window and an interval with specified duration;
performing discrete Fourier transform on the sampling points in each window;
calculating the energy of the Mel space according to the result of the discrete Fourier transform;
the energy of the Mel space is filtered and then discrete cosine transformed to obtain Mel frequency cepstrum coefficient, and the Mel Frequency Cepstrum Coefficient (MFCC) is used as acoustic characteristics corresponding to the audio data.
The following describes the feature extraction process with a specific example. For example, a window with a duration of 25ms is specified, the interval is 10ms, and the feature extraction process of the audio data to be recognized is as follows:
1) the audio data to be identified is cut into 25ms window segments according to time, for example, the audio with a sampling rate of 16000 includes 400 sampling points 0.025 × 16000; the interval between the windows is 10ms, and the windows are allowed to overlap;
2) performing Discrete Fourier Transform (DFT) on the sampling points in each window, wherein the specific formula is as follows:
Figure BDA0003129339070000121
wherein Si(k) For the DFT result, h (N) is the Hamming window of length N, and K is the length of the DFT.
3) And calculating the energy of the Mel spatial according to the result of the discrete Fourier transform, wherein the formula is as follows:
Figure BDA0003129339070000122
wherein S isi(k) As a result of DFT, Pi(k) Energy in mel space;
4) the energy in the Mel space is filtered and then Discrete Cosine Transformed (DCT) is carried out to obtain Mel Frequency Cepstrum Coefficient (MFCC), and the MFCC is taken as an acoustic feature corresponding to the audio data.
The extraction mode of the MFCC acoustic features is based on Hamming window sampling and DCT calculation, so that the acoustic feature extraction is effectively realized, the extraction accuracy is improved, and powerful data support is provided for voice recognition.
In a possible embodiment, the method further comprises at least one of:
if a sentence ending mark is detected during decoding, ending the decoding;
if the specified time length is exceeded after the mute sign is detected during decoding, the decoding is finished;
and if the current state accords with the specified ending state during decoding, ending the decoding.
The various decoding ending modes enrich the means for controlling the decoding ending, can flexibly set specific implementation modes according to the requirements in practical application, and are convenient and quick.
In a possible implementation, the method may further include:
the plurality of speech recognition models are trained using speech training data and an initial label, the initial label being a coarsest level of text corresponding to the speech training data.
The method for training the plurality of voice recognition models based on the voice training data and the initial labels can be used for training on the basis of the voice training data corresponding to the coarsest level text, so that the voice recognition models can learn the mapping relation between the audio data and the acoustic labels more easily, convergence of model training is accelerated, and training speed is increased.
In the embodiment of the present disclosure, the text level, the syllable level, the phone level, and the phone level with the context information are provided, and the coarsest level among the four levels is the text level, and the finest level is the phone level with the context information. For example, the audio training data is "classmates are class", and the corresponding initial label is the coarsest level text "classmates are class".
Fig. 2 schematically shows a flow chart of a speech recognition method implementation according to an embodiment of the present disclosure. As shown in fig. 2, the speech recognition method of the embodiment of the present disclosure includes the following steps:
s21: performing feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data;
s22: inputting acoustic features into a plurality of pre-trained speech recognition models to respectively obtain a plurality of target probability distributions corresponding to the speech recognition models;
the target probability distributions correspond to a plurality of levels of acoustic label systems, and the corresponding target probability distributions represent the matching degree between each acoustic label and the acoustic feature under the acoustic label system of the level;
in the embodiment of the present disclosure, the plurality of speech recognition models may be implemented using various structures, such as a HMM-GMM (Hidden Markov Model-Gaussian mixture Model) structure. By using the structure, the most possible hidden state sequence can be obtained dynamically through a forward and backward algorithm of the HMM, so that the acoustic label corresponding to the audio data is obtained. For example, the audio data is "Shanghai is an International metropolitan city", and the acoustic labels "shang 4 hai3 shi4 guo2 ji4 da4 du1 shi 4" at the syllable level and "sh ang4 h ai3 sh i4 g uo2 j i4 d a4 d u1 shi 4" at the phoneme level can be obtained by the above-structured speech recognition model.
S23: determining elements in an acoustic label system corresponding to each voice recognition model, selecting a candidate result of a next element according to target probability distribution of the next element by taking a first element in the acoustic label system corresponding to the voice recognition model as a prefix, and constructing a current decoding path by the prefix and the candidate result for decoding; in the same way, in each decoding, the last decoding path is used as the current prefix, and the current decoding path is constructed by combining the next element until the complete decoding path is obtained;
in a possible implementation, the selecting the candidate result of the next element according to the target probability distribution of the next element may include:
sorting the recognition results of the next element from high to low according to the target probability distribution, and selecting the target probability distribution with the designated number in the front sorting; and taking the recognition result corresponding to the selected target probability distribution as the candidate result of the next element.
In the embodiment of the present disclosure, the decoding path represents a recognition process of the voice recognition model on the acoustic feature, and the decoding path obtained after the recognition is finished represents the acoustic tag recognized for the acoustic feature.
S24: calculating a prefix score corresponding to each speech recognition model based on the decoding path of each speech recognition model;
in a possible implementation, S24 may specifically include:
and for each voice recognition model, calculating the selection rate of the decoding path of the voice recognition model, summing the obtained selection rates of all the decoding paths, and then taking the logarithm to obtain the prefix score of the voice recognition model.
The prefix score of the speech recognition model can be expressed by the following formula:
Figure BDA0003129339070000141
wherein the content of the first and second substances,
Figure BDA0003129339070000142
a prefix score for the speech recognition model, X is audio data,
Figure BDA0003129339070000143
for the prefix, v is the candidate for the next element,
Figure BDA0003129339070000144
for the current decoding path constructed from the prefix and the candidate result,
Figure BDA0003129339070000145
the hit rate for the current decoding path can be directly obtained from the output of the corresponding speech recognition model.
S25: multiplying the prefix score of each voice recognition model by the set corresponding weight, and then summing all the obtained products to obtain a decoding objective function;
the decoding objective function is used for solving an optimal decoding path from a plurality of decoding paths, and the decoding objective function is realized in a manner that the decoding path with the maximum decoding objective function is found to be the optimal decoding path, and then the optimal decoding path can be used as the recognition result of the audio data. Specifically, the following formula can be used to represent:
Figure BDA0003129339070000146
wherein the content of the first and second substances,
Figure BDA0003129339070000147
in order to decode the objective function(s),
Figure BDA0003129339070000148
for each of the speech recognition models a prefix score, λ123,.. is a preset weight corresponding to each speech recognition model and has lambda123+…=1。
S26: uniformly converting decoding paths of all voice recognition models in a decoding target function into decoding paths at a specified level by adopting a preset sparse matrix;
the sparse matrix is a mapping relation between acoustic tag system elements of different levels, and is used for converting an acoustic tag system element of one level to an acoustic tag system element of another level. In specific application, a sparse matrix between any two levels of acoustic tag system elements can be set, so that the acoustic tag system elements of different levels can be conveniently converted. The decoding paths of different levels can be unified into the decoding path of the appointed level through the conversion, so that the decoding path which enables the decoding objective function to be maximum can be found on the basis of level unification, the recognition result of the audio data is obtained, the operation efficiency is improved, and the recognition result of the appointed level is output.
The specified level may be a text level, a syllable level, a phone level, or a phone level with contextual background information. In general, the designated level is a level required by the speech recognition output, and can be set according to actual needs, such as setting the designated level to a text level. If the speech recognition model corresponding to the appointed level exists in the plurality of speech recognition models, the model does not need to be converted into the decoding path of the appointed level, and the models of other levels are converted into the decoding path of the appointed level. For convenience of implementation, an identity matrix can be introduced as a sparse matrix, the row and column elements of the sparse matrix are the same, only the diagonal value in the matrix is 1, and the remaining values are 0, so that the element conversion of the acoustic tag system at the same level can be completed. Such as text level to text level conversion, syllable level to syllable level conversion, etc.
Through the process, the accuracy of Chinese speech recognition can be improved, the method and the device can be suitable for various scenes including scenes with uneven data distribution, such as scenes with uncommon words, multiple homophones and the like, the output results of speech recognition models of multiple levels are explicitly considered in a fusion decoding mode, and the problem that recognition is not accurate enough in the scenes with uneven data distribution can be solved to a great extent. The introduction of a plurality of voice recognition models can make up the defect of inaccurate recognition of a single voice recognition model, thereby greatly improving the voice recognition effect and greatly improving the robustness of the model. For example, the recognition accuracy in scenes such as uncommon words and multiple homophones can be greatly improved by adding syllable-level and phoneme-level voice recognition models on the basis of the character-level voice recognition models. In addition, the conversion of the decoding path is carried out based on the sparse matrix, so that the fusion of information of different levels can be rapidly realized during decoding, and the efficiency of voice recognition is further improved.
Fig. 3 schematically illustrates a sparse matrix schematic according to an embodiment of the present disclosure. Referring to fig. 3, the sparse matrix is a mapping relationship between acoustic label system elements at syllable level and acoustic label system elements at text level. The row of the sparse matrix represents syllables, the column represents characters, the value in the matrix is 0 or 1, 0 represents that two elements corresponding to the row and the column have no mapping relation, and 1 represents that two elements corresponding to the row and the column have mapping relation. For example, the value in the first row and column is 1, and the corresponding word "big" has a mapping relationship with the syllable "da 4". Wherein, the polyphone "all" has two values "1" in the sparse matrix, corresponding to its two pronunciations "du 1" and "dou 1", respectively.
In the embodiment of the present disclosure, the plurality of speech recognition models may be regarded as a whole as a network for speech recognition, the sparse matrix may also be regarded as the last layer of the network, and linear matrix operation is performed through the sparse matrix, so that conversion between elements of the acoustic tag system at different levels can be completed, and the operation efficiency in actual use is improved.
S27: and finding a decoding path which maximizes the decoding objective function, and using the decoding path as the recognition result of the audio data at the specified level.
Fig. 4 schematically shows a test flow diagram according to an embodiment of the present disclosure. Referring to fig. 4, the audio data to be recognized is subjected to feature extraction to obtain acoustic features, and the acoustic features are respectively input into the 3 speech recognition models 1, 2 and 3 at different levels to obtain the target probability distribution 1, the target probability distribution 2 and the target probability distribution 3 corresponding to each other. The target probability distributions of the three voice recognition models respectively correspond to acoustic label systems of different levels, and respectively represent the matching degree between the acoustic labels and the acoustic features under the corresponding acoustic label systems. The different levels may be set to any three of a text level, a syllable level, a phone level, or a phone level with contextual background information, as desired. And finally, performing fusion decoding on the target probability distributions of the three different levels to obtain the recognition result of the audio data, thereby realizing the Chinese speech recognition based on a plurality of speech recognition models, covering more application scenes including recognition scenes with uneven data distribution such as rare words or homophones and the like, and further improving the accuracy of the Chinese speech recognition and the robustness of the models based on the fusion decoding.
FIG. 5 schematically shows a flow diagram of a model training implementation according to an embodiment of the present disclosure. As shown in fig. 5, the speech recognition method of the embodiment of the present disclosure further includes the following steps:
s51: inputting voice training data into a plurality of voice recognition models to respectively obtain corresponding acoustic labels;
s52: calculating a cost function for each obtained acoustic label and setting corresponding weight;
s53: summing the products of the cost function of each acoustic label and the corresponding weight to obtain a total cost function;
the total cost function can be expressed by the following formula:
L=α1l12l23l3+…;
wherein l1,l2,l3,.. cost function, α, for each acoustic tag, respectively123,.. respectively corresponding to the weight of each cost function, and setting specific numerical values as required to satisfy alpha123+…=1。
S54: and training a plurality of voice recognition models by taking the minimum total cost function as a target according to the initial label.
Wherein, the initial label is the text of the coarsest level corresponding to the voice training data.
Through the process, the accuracy of Chinese speech recognition can be improved, the method and the device can be suitable for various scenes, including scenes with uneven data distribution, such as scenes with rarely-used characters, multiple homophones and the like, and multitask mode training is carried out based on multiple speech recognition models, the corresponding relations of different levels between acoustic features and acoustic labels can be learned, the robustness of the models to different scenes is greatly improved, and the situations that the rarely-used characters, the multiple homophones and the like cannot be distinguished and covered on the acoustic labels at the character level can be relieved to a great extent.
Fig. 6 schematically shows a training flow diagram according to an embodiment of the present disclosure. Referring to fig. 6, the audio data to be trained is subjected to feature extraction to obtain acoustic features, and the acoustic features are respectively input to a plurality of speech recognition models of different levels, such as the speech recognition model 1 and the speech recognition model 2, so as to obtain a plurality of acoustic tags, such as the acoustic tag 1 and the acoustic tag 2, which correspond to each other. Each voice recognition model corresponds to one level of acoustic label system, and the obtained acoustic label is an acoustic label under the acoustic label system of the corresponding level, so that acoustic labels of multiple levels can be obtained. The plurality of speech recognition models are then trained using the initial labels and the plurality of levels of acoustic labels, so that correspondences between acoustic features and the plurality of levels of acoustic labels can be learned. The training mode enriches the learning content of the model, enhances the learning ability of the model, and enables the trained model to be suitable for wider application scenes.
Fig. 7 schematically illustrates a flow diagram of training and testing according to an embodiment of the present disclosure. Referring to fig. 7, the speech recognition method provided by the embodiment of the present disclosure includes two stages: a training phase and a testing phase. In the training stage, voice training data is input into a plurality of voice recognition models to obtain acoustic labels corresponding to the voice training data, and the voice recognition models correspond to acoustic label systems of different levels respectively. And training the plurality of voice recognition models based on the obtained plurality of acoustic labels, so as to learn the corresponding relation between the acoustic features and the acoustic labels of different levels. In the testing stage, the audio data to be recognized are input into each trained speech recognition model, and corresponding target probability distribution is obtained. The obtained target probability distributions respectively correspond to acoustic label systems of different levels, and respectively represent the matching degree between the acoustic labels and the acoustic features under the corresponding acoustic label systems. And finally, performing fusion decoding on the target probability distributions of different levels to obtain the recognition result of the audio data, thereby realizing Chinese speech recognition based on a plurality of speech recognition models, covering more application scenes, and further improving the accuracy of the Chinese speech recognition and the robustness of the models based on the fusion decoding.
The above method provided by the embodiment of the present disclosure, compared with the voice recognition method in the prior art, results can be shown in table 1 below. It can be seen that the speech recognition accuracy of the prior art is low, and there are many recognition errors, but the recognition accuracy of the embodiment of the present disclosure is very high. Therefore, the speech recognition accuracy rate under the scene of uneven data distribution such as homophones and the like can be well improved.
TABLE 1
Figure BDA0003129339070000181
Exemplary Medium
Having described the method of the exemplary embodiment of the present disclosure, the medium of the exemplary embodiment of the present disclosure is explained next with reference to fig. 8.
In some possible embodiments, various aspects of the disclosure may also be implemented as a computer-readable medium on which a program is stored, which, when executed by a processor, is for implementing steps in a speech recognition method according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification.
Specifically, the processor is configured to implement the following steps when executing the program:
performing feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data; inputting the acoustic features into a plurality of pre-trained voice recognition models to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, wherein the target probability distributions correspond to a plurality of levels of acoustic label systems, and the corresponding target probability distributions represent the matching degree between each acoustic label and the acoustic features under the acoustic label system of the level; and performing fusion decoding on the target probability distributions to obtain an identification result of the audio data.
It should be noted that: the above-mentioned medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
As shown in fig. 8, a medium 80 that can employ a portable compact disc read only memory (CD-ROM) and include a program and can be run on a device according to an embodiment of the present disclosure is described. However, the disclosure is not so limited, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take a variety of forms, including, but not limited to: an electromagnetic signal, an optical signal, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).
Exemplary devices
Having described the media of the exemplary embodiments of the present disclosure, the apparatus of the exemplary embodiments of the present disclosure is described next with reference to fig. 9.
As shown in fig. 9, a speech recognition apparatus of an embodiment of the present disclosure may include:
an extracting module 901, configured to perform feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data;
the recognition module 902 is configured to input the acoustic features into a plurality of pre-trained speech recognition models to obtain a plurality of target probability distributions corresponding to the speech recognition models, where the plurality of target probability distributions correspond to a plurality of levels of acoustic label systems, and a corresponding target probability distribution represents a matching degree between each acoustic label and the acoustic feature in the level of the acoustic label system;
and the fusion module 903 is configured to perform fusion decoding on the multiple target probability distributions to obtain an identification result of the audio data.
In one possible embodiment, the multi-level acoustic tagging architecture includes at least two of: a text-level acoustic tagging scheme, a syllable-level acoustic tagging scheme, a phone-level acoustic tagging scheme, and a phone-level tagging scheme with contextual background information.
In a possible embodiment, the fusion module includes:
the construction submodule is used for constructing a decoding path of each voice recognition model according to the target probability distribution, the decoding path represents the recognition process of the corresponding voice recognition model on the acoustic characteristics, and the decoding path obtained after the recognition is finished represents the acoustic label recognized aiming at the acoustic characteristics;
and the calculation submodule is used for calculating a decoding target function based on the decoding path of each voice recognition model, finding out the decoding path which enables the decoding target function to be maximum, and taking the decoding path as the recognition result of the audio data.
In a possible embodiment, the above-mentioned construction submodule comprises:
the determining unit is used for determining elements in the acoustic label system corresponding to each voice recognition model;
and the construction unit is used for constructing a decoding path of the voice recognition model by taking the element as a prefix and based on the target probability distribution corresponding to the voice recognition model.
In a possible embodiment, the above-mentioned construction unit is configured to:
taking a first element in an acoustic label system corresponding to the voice recognition model as a prefix, selecting a candidate result of a next element according to the target probability distribution of the next element, and constructing a current decoding path for decoding by the prefix and the candidate result;
and by analogy, in each decoding, the last decoding path is used as the current prefix, and the current decoding path is constructed by combining the next element until the complete decoding path is obtained.
In a possible embodiment, the above construction unit is specifically configured to select the candidate result of the next element as follows:
sorting the recognition results of the next element from high to low according to the target probability distribution, and selecting the target probability distribution with the designated number in the front sorting;
and taking the recognition result corresponding to the selected target probability distribution as the candidate result of the next element.
In a possible implementation, the computing submodule includes:
a first calculation unit configured to calculate a prefix score corresponding to each of the speech recognition models based on a decoding path of each of the speech recognition models;
and the second calculation unit is used for multiplying the prefix score of each voice recognition model by the set corresponding weight, and then summing all the obtained products to obtain a decoding objective function.
In a possible implementation, the first computing unit is configured to:
and for each voice recognition model, calculating the selection rate of the decoding path of the voice recognition model, summing the obtained selection rates of all the decoding paths, and then taking the logarithm to obtain the prefix score of the voice recognition model.
In a possible embodiment, the computing submodule is configured to:
uniformly converting decoding paths of all voice recognition models in a decoding target function into decoding paths of an appointed level by adopting a preset sparse matrix, wherein the sparse matrix is a mapping relation between acoustic tag system elements of different levels and is used for converting the acoustic tag system elements of one level into the acoustic tag system elements of another level;
and finding a decoding path which maximizes the decoding objective function, and using the decoding path as the recognition result of the audio data at the specified level.
In one possible embodiment, the designated level is a text level, a syllable level, a phone level, or a phone level with contextual background information.
In a possible embodiment, the apparatus further comprises at least one of:
the first end module is used for ending the decoding if a sentence end mark is detected during the decoding;
the second ending module is used for ending the decoding if the specified time length is exceeded after the mute sign is detected during the decoding;
and the third ending module is used for ending the decoding if the current state accords with the specified ending state during the decoding.
In a possible implementation, the extracting module is configured to:
sampling audio data to be identified according to a window and an interval with specified duration;
performing discrete Fourier transform on the sampling points in each window;
calculating the energy of the Mel space according to the result of the discrete Fourier transform;
and filtering the energy of the Mel space, performing discrete cosine transform to obtain Mel frequency cepstrum coefficient, and taking the Mel frequency cepstrum coefficient as acoustic characteristics corresponding to the audio data.
In a possible embodiment, the above apparatus further comprises:
and the training module is used for training the plurality of voice recognition models by using the voice training data and the initial labels, wherein the initial labels are the coarsest level texts corresponding to the voice training data.
In a possible embodiment, the training module is configured to:
inputting voice training data into a plurality of voice recognition models to respectively obtain corresponding acoustic labels;
calculating a cost function for each obtained acoustic label, setting corresponding weight, and summing the products of the cost function of each acoustic label and the corresponding weight to obtain a total cost function;
and training a plurality of voice recognition models by taking the minimum total cost function as a target according to the initial label.
According to the device provided by the embodiment of the disclosure, the audio data to be recognized is subjected to feature extraction and then input into the pre-trained voice recognition models, so that the corresponding target probability distributions are obtained and are subjected to fusion decoding, and thus, the recognition result is obtained. The recognition result is obtained by fusion decoding on the basis of recognition of a plurality of speech recognition models, and the speech recognition models correspond to a plurality of levels of acoustic label systems and can recognize audio data under the acoustic label systems of the plurality of levels, so that more application scenes can be covered, such as scenes with uneven data distribution of rare words, multiple homophones and the like, and the accuracy of Chinese speech recognition and the robustness of the models are further improved on the basis of the fusion decoding.
Exemplary computing device
Having described the methods, media, and apparatus of the exemplary embodiments of the present disclosure, a computing device of the exemplary embodiments of the present disclosure is described next with reference to fig. 10.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
In some possible implementations, a computing device according to embodiments of the present disclosure may include at least one processing unit and at least one memory unit. Wherein the storage unit stores program code that, when executed by the processing unit, causes the processing unit to perform the steps in the speech recognition methods according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section above in this specification.
A computing device 100 according to such an embodiment of the present disclosure is described below with reference to fig. 10. The computing device 100 shown in fig. 10 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the disclosure.
As shown in fig. 10, computing device 100 is embodied in the form of a general purpose computing device. Components of computing device 100 may include, but are not limited to: the at least one processing unit 1001 and the at least one storage unit 1002 are connected to a bus 1003 that connects different system components (including the processing unit 1001 and the storage unit 1002).
The bus 1003 includes a data bus, a control bus, and an address bus.
The storage unit 1002 can include readable media in the form of volatile memory, such as Random Access Memory (RAM)10021 and/or cache memory 10022, and can further include readable media in the form of non-volatile memory, such as Read Only Memory (ROM) 10023.
The storage unit 1002 may also include a program/utility 10025 having a set (at least one) of program modules 10024, such program modules 10024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Computing device 100 may also communicate with one or more external devices 1004 (e.g., keyboard, pointing device, etc.). Such communication may occur via input/output (I/O) interface 1005. Moreover, computing device 100 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through network adapter 1006. As shown in FIG. 10, network adapter 1006 communicates with the other modules of computing device 100 via bus 1003. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
It should be noted that although in the above detailed description several units/modules or sub-units/sub-modules of the speech recognition apparatus are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A speech recognition method, comprising:
performing feature extraction on audio data to be identified to obtain acoustic features corresponding to the audio data;
inputting the acoustic features into a plurality of pre-trained voice recognition models to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, wherein the target probability distributions correspond to a plurality of levels of acoustic label systems, and the corresponding target probability distributions represent the matching degree between each acoustic label and the acoustic features under the acoustic label system of the level;
and performing fusion decoding on the target probability distributions to obtain an identification result of the audio data.
2. The method of claim 1, wherein the plurality of levels of acoustic tagging schemes include at least two of: a text-level acoustic tagging scheme, a syllable-level acoustic tagging scheme, a phone-level acoustic tagging scheme, and a phone-level tagging scheme with contextual background information.
3. The method of claim 1, wherein the performing the fusion decoding on the plurality of target probability distributions to obtain the recognition result of the audio data comprises:
constructing a decoding path of each voice recognition model according to the target probability distributions, wherein the decoding path represents the recognition process of the corresponding voice recognition model on the acoustic characteristics, and the decoding path obtained after the recognition represents the acoustic label recognized aiming at the acoustic characteristics;
and calculating a decoding objective function based on the decoding paths of the voice recognition models, finding a decoding path which enables the decoding objective function to be maximum, and taking the decoding path as the recognition result of the audio data.
4. The method of claim 3, wherein constructing a decoding path for each speech recognition model based on the plurality of target probability distributions comprises:
determining elements in an acoustic label system corresponding to each voice recognition model, and constructing a decoding path of the voice recognition model by taking the elements as prefixes and based on target probability distribution corresponding to the voice recognition model.
5. The method of claim 4, wherein constructing the decoding path of the speech recognition model by using the element as the prefix and based on the target probability distribution corresponding to the speech recognition model comprises:
selecting a candidate result of a next element according to the target probability distribution of the next element by taking a first element in an acoustic label system corresponding to the voice recognition model as a prefix, and constructing a current decoding path for decoding by the prefix and the candidate result;
and by analogy, in each decoding, the last decoding path is used as the current prefix, and the current decoding path is constructed by combining the next element until the complete decoding path is obtained.
6. The method of claim 3, wherein computing a decoding objective function based on the decoding paths of the respective speech recognition models comprises:
calculating a prefix score corresponding to the respective speech recognition model based on the decoding path of the respective speech recognition model;
and multiplying the prefix scores of the voice recognition models by the set corresponding weights, and then summing all the obtained products to obtain a decoding objective function.
7. The method according to claim 3, wherein the finding of the decoding path that maximizes the decoding objective function, which is used as the recognition result of the audio data, comprises:
uniformly converting decoding paths of all voice recognition models in the decoding objective function into decoding paths at a specified level by adopting a preset sparse matrix, wherein the sparse matrix is a mapping relation between acoustic tag system elements at different levels and is used for converting the acoustic tag system elements at one level into the acoustic tag system elements at another level;
and finding a decoding path which maximizes the decoding objective function, and using the decoding path as the recognition result of the audio data at the specified level.
8. A speech recognition apparatus, comprising:
the extraction module is used for extracting the characteristics of the audio data to be identified so as to obtain the acoustic characteristics corresponding to the audio data;
the recognition module is used for inputting the acoustic features into a plurality of pre-trained voice recognition models so as to respectively obtain a plurality of target probability distributions corresponding to the voice recognition models, wherein the target probability distributions correspond to a plurality of levels of acoustic label systems, and the corresponding target probability distributions represent the matching degree between each acoustic label and the acoustic features under the acoustic label system of the level;
and the fusion module is used for performing fusion decoding on the target probability distributions to obtain the identification result of the audio data.
9. A medium storing a computer program, characterized in that the program, when being executed by a processor, carries out the method according to any one of claims 1-7.
10. A computing device, comprising:
one or more processors;
storage means for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-7.
CN202110698074.7A 2021-06-23 2021-06-23 Speech recognition method, medium, device and computing equipment Active CN113327597B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110698074.7A CN113327597B (en) 2021-06-23 2021-06-23 Speech recognition method, medium, device and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110698074.7A CN113327597B (en) 2021-06-23 2021-06-23 Speech recognition method, medium, device and computing equipment

Publications (2)

Publication Number Publication Date
CN113327597A true CN113327597A (en) 2021-08-31
CN113327597B CN113327597B (en) 2023-08-22

Family

ID=77424406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110698074.7A Active CN113327597B (en) 2021-06-23 2021-06-23 Speech recognition method, medium, device and computing equipment

Country Status (1)

Country Link
CN (1) CN113327597B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539273A (en) * 2021-09-16 2021-10-22 腾讯科技(深圳)有限公司 Voice recognition method and device, computer equipment and storage medium
CN113948085A (en) * 2021-12-22 2022-01-18 中国科学院自动化研究所 Speech recognition method, system, electronic device and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013002674A1 (en) * 2011-06-30 2013-01-03 Kocharov Daniil Aleksandrovich Speech recognition system and method
CN103559881A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Language-irrelevant key word recognition method and system
CN103700369A (en) * 2013-11-26 2014-04-02 安徽科大讯飞信息科技股份有限公司 Voice navigation method and system
CN105161092A (en) * 2015-09-17 2015-12-16 百度在线网络技术(北京)有限公司 Speech recognition method and device
CN108510990A (en) * 2018-07-04 2018-09-07 百度在线网络技术(北京)有限公司 Audio recognition method, device, user equipment and storage medium
CN110534095A (en) * 2019-08-22 2019-12-03 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and computer readable storage medium
CN111508497A (en) * 2019-01-30 2020-08-07 北京猎户星空科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112102815A (en) * 2020-11-13 2020-12-18 深圳追一科技有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN112652306A (en) * 2020-12-29 2021-04-13 珠海市杰理科技股份有限公司 Voice wake-up method and device, computer equipment and storage medium
US20210134312A1 (en) * 2019-11-06 2021-05-06 Microsoft Technology Licensing, Llc Audio-visual speech enhancement
CN112802461A (en) * 2020-12-30 2021-05-14 深圳追一科技有限公司 Speech recognition method and device, server, computer readable storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013002674A1 (en) * 2011-06-30 2013-01-03 Kocharov Daniil Aleksandrovich Speech recognition system and method
CN103559881A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Language-irrelevant key word recognition method and system
CN103700369A (en) * 2013-11-26 2014-04-02 安徽科大讯飞信息科技股份有限公司 Voice navigation method and system
CN105161092A (en) * 2015-09-17 2015-12-16 百度在线网络技术(北京)有限公司 Speech recognition method and device
CN108510990A (en) * 2018-07-04 2018-09-07 百度在线网络技术(北京)有限公司 Audio recognition method, device, user equipment and storage medium
CN111508497A (en) * 2019-01-30 2020-08-07 北京猎户星空科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN110534095A (en) * 2019-08-22 2019-12-03 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and computer readable storage medium
US20210134312A1 (en) * 2019-11-06 2021-05-06 Microsoft Technology Licensing, Llc Audio-visual speech enhancement
CN112102815A (en) * 2020-11-13 2020-12-18 深圳追一科技有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN112652306A (en) * 2020-12-29 2021-04-13 珠海市杰理科技股份有限公司 Voice wake-up method and device, computer equipment and storage medium
CN112802461A (en) * 2020-12-30 2021-05-14 深圳追一科技有限公司 Speech recognition method and device, server, computer readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539273A (en) * 2021-09-16 2021-10-22 腾讯科技(深圳)有限公司 Voice recognition method and device, computer equipment and storage medium
CN113948085A (en) * 2021-12-22 2022-01-18 中国科学院自动化研究所 Speech recognition method, system, electronic device and storage medium
CN113948085B (en) * 2021-12-22 2022-03-25 中国科学院自动化研究所 Speech recognition method, system, electronic device and storage medium
US11501759B1 (en) 2021-12-22 2022-11-15 Institute Of Automation, Chinese Academy Of Sciences Method, system for speech recognition, electronic device and storage medium

Also Published As

Publication number Publication date
CN113327597B (en) 2023-08-22

Similar Documents

Publication Publication Date Title
CN111883110B (en) Acoustic model training method, system, equipment and medium for speech recognition
US8868431B2 (en) Recognition dictionary creation device and voice recognition device
JP7200405B2 (en) Context Bias for Speech Recognition
US8065149B2 (en) Unsupervised lexicon acquisition from speech and text
JP2559998B2 (en) Speech recognition apparatus and label generation method
CN109686383B (en) Voice analysis method, device and storage medium
JP2002258890A (en) Speech recognizer, computer system, speech recognition method, program and recording medium
US20080027725A1 (en) Automatic Accent Detection With Limited Manually Labeled Data
CN111369974B (en) Dialect pronunciation marking method, language identification method and related device
CN112331229B (en) Voice detection method, device, medium and computing equipment
JP2021018413A (en) Method, apparatus, device, and computer readable storage medium for recognizing and decoding voice based on streaming attention model
CN111611349A (en) Voice query method and device, computer equipment and storage medium
Yue et al. End-to-end code-switching asr for low-resourced language pairs
CN113327597B (en) Speech recognition method, medium, device and computing equipment
US8805871B2 (en) Cross-lingual audio search
CN113327574B (en) Speech synthesis method, device, computer equipment and storage medium
KR20230086737A (en) Cascade Encoders for Simplified Streaming and Non-Streaming Speech Recognition
JP5688761B2 (en) Acoustic model learning apparatus and acoustic model learning method
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
Tanaka et al. Neural speech-to-text language models for rescoring hypotheses of dnn-hmm hybrid automatic speech recognition systems
KR101424496B1 (en) Apparatus for learning Acoustic Model and computer recordable medium storing the method thereof
US7272560B2 (en) Methodology for performing a refinement procedure to implement a speech recognition dictionary
Zhou et al. UnitNet: A sequence-to-sequence acoustic model for concatenative speech synthesis
US20210174789A1 (en) Automatic speech recognition device and method
KR102299269B1 (en) Method and apparatus for building voice database by aligning voice and script

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant