CN113823265A

CN113823265A - Voice recognition method and device and computer equipment

Info

Publication number: CN113823265A
Application number: CN202110815555.1A
Authority: CN
Inventors: 胡鹏飞; 麻国栋; 黄申
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-12-21

Abstract

The embodiment of the application discloses a voice recognition method, a voice recognition device and computer equipment; the method and the device can acquire at least one voice characteristic frame of voice data in a target language; respectively carrying out phoneme alignment on at least one voice characteristic frame to obtain a target phoneme set of the voice data in the target language; respectively carrying out word unit alignment on at least one voice characteristic frame to obtain a target word set of voice data in a target language, wherein the target word set comprises a word unit corresponding to each voice characteristic frame; respectively carrying out text mapping on at least one voice characteristic frame to obtain an initial voice recognition text of the voice data in the target language; and adjusting the initial voice recognition text according to the target phoneme set and the target word set to obtain and output the voice recognition text of the voice data, so that the accuracy of voice recognition is improved.

Description

Voice recognition method and device and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a speech recognition method, apparatus, and computer device.

Background

In recent years, with the rapid development of information science and technology, speech recognition technology has also been rapidly developed, and gradually changes our life and working modes. For example, products such as voice-controlled voice dialing systems, voice-controlled intelligent toys, intelligent household appliances and the like can enable man-machine communication to be simple and easy.

However, there are a wide variety of different languages, such as chinese, english, russian, arabic, etc., all of which belong to different languages, and each of which has its own features. For example, some languages have a phenomenon of voice weakening. In the existing voice recognition system, the phenomenon is generally modeled by using a multi-pronunciation dictionary, but the phenomenon of voice weakening cannot be exhausted in the modeling process. If the existing voice recognition system is used for recognizing the voice with the voice weakening phenomenon, the accuracy of voice recognition is reduced.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device and computer equipment, and the accuracy of voice recognition can be improved.

The embodiment of the application provides a voice recognition method, which comprises the following steps:

acquiring at least one voice characteristic frame of voice data in a target language;

performing phoneme alignment on the at least one voice characteristic frame to obtain a target phoneme set of the voice data in the target language;

performing word unit alignment on the at least one voice characteristic frame to obtain a target word set of the voice data in the target language, wherein the target word set comprises a word unit corresponding to each voice characteristic frame;

performing text mapping on the at least one voice characteristic frame to obtain an initial voice recognition text of the voice data in a target language;

and adjusting the initial voice recognition text according to the target phoneme set and the target word set to obtain and output the voice recognition text of the voice data.

Correspondingly, the embodiment of the present application further provides a speech recognition apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring at least one voice characteristic frame of voice data in a target language;

a phoneme aligning unit, configured to perform phoneme alignment on the at least one speech feature frame to obtain a target phoneme set of the speech data in the target language;

a word unit alignment unit, configured to perform word unit alignment on the at least one voice feature frame to obtain a target word set of the voice data in the target language, where the target word set includes a word unit corresponding to each voice feature frame;

the text mapping unit is used for performing text mapping on the at least one voice characteristic frame to obtain an initial voice recognition text of the voice data in the target language;

and the adjusting unit is used for adjusting the initial voice recognition text according to the target phoneme set and the target word set to obtain and output the voice recognition text of the voice data.

In one embodiment, the phoneme alignment unit includes:

the path searching subunit is used for performing path searching on each voice characteristic frame in a preset phoneme searching space to obtain at least one phoneme searching path;

a calculating subunit, configured to calculate an accumulated probability of the speech feature frame on each phoneme search path;

and the determining subunit is used for determining the target phoneme set of the voice data according to the cumulative probability.

In one embodiment, the path searching subunit includes:

the feature enhancement module is used for carrying out feature enhancement on the voice feature frame under the phoneme granularity to obtain the phoneme features of the voice feature frame;

the screening module is used for screening a target phoneme set from the multiple phoneme sets according to the phoneme characteristics;

and the phoneme searching module is used for searching phonemes in the target phoneme set according to the phoneme characteristics and generating at least one phoneme searching path according to a searching result.

In one embodiment, the phoneme search module includes:

a calculation submodule for calculating matching probabilities between the phoneme feature and the plurality of phoneme nodes, respectively;

a determining submodule for determining at least one target phoneme node among the plurality of phoneme nodes according to the matching probability;

and the association submodule is used for associating the target phoneme nodes of each phoneme feature to obtain at least one target search path.

In one embodiment, the word unit alignment unit includes:

the path searching subunit is used for performing path searching on each voice characteristic frame in a preset dictionary searching space to obtain at least one word unit searching path;

the calculating subunit is used for calculating the cumulative probability of the voice characteristic frame on each word unit searching path;

and the determining subunit is used for determining the target word set of the voice feature frame according to the cumulative probability.

In one embodiment, the text mapping unit includes:

an attention feature extraction subunit, configured to perform attention feature extraction on the speech feature frame in multiple attention dimensions, so as to obtain attention features of the speech feature frame in each attention dimension;

and the decoding subunit is used for decoding the attention features on the attention dimensions to obtain an initial speech recognition text of the speech data in the target language.

In an embodiment, the adjusting unit includes:

a recognition subunit, configured to recognize phoneme information of the initial speech recognition text;

the first adjusting subunit is configured to adjust the phoneme information by using the target phoneme set to obtain a phoneme-adjusted speech recognition text;

and the second adjusting subunit is used for adjusting the voice recognition text after the phoneme adjustment by using the target word set to obtain and output the voice recognition text of the voice data.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the method provided in the various alternatives of the above aspect.

Correspondingly, the embodiment of the present application further provides a storage medium, where the storage medium stores instructions, and the instructions, when executed by a processor, implement the speech recognition method provided in any of the embodiments of the present application.

The method and the device can acquire at least one voice characteristic frame of voice data in a target language; respectively carrying out phoneme alignment on at least one voice characteristic frame to obtain a target phoneme set of the voice data in the target language; respectively carrying out word unit alignment on at least one voice characteristic frame to obtain a target word set of voice data in a target language, wherein the target word set comprises a word unit corresponding to each voice characteristic frame; respectively carrying out text mapping on at least one voice characteristic frame to obtain an initial voice recognition text of the voice data in the target language; and adjusting the initial voice recognition text according to the target phoneme set and the target word set to obtain and output the voice recognition text of the voice data, so that the accuracy of voice recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic scene diagram of a speech recognition method provided in an embodiment of the present application;

FIG. 2 is a flow chart of a speech recognition method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a scenario for performing windowing sliding on voice data according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a preset speech recognition model provided in an embodiment of the present application;

fig. 5 is a schematic flowchart of generating a phoneme identification frame according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a scene for performing phoneme masking on a phoneme annotation frame according to an embodiment of the present application;

fig. 7 is a scene schematic diagram for training a preset speech recognition model to be trained according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a scene of a preset phoneme search space provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of a scenario of a path search provided in an embodiment of the present application;

FIG. 10 is a schematic flow chart of a speech recognition method provided in the embodiments of the present application;

FIG. 11 is a schematic structural diagram of a speech recognition-based apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, however, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a voice recognition method, which can be executed by a voice recognition based device, and the voice recognition based device can be integrated in computer equipment. The computer device may include a terminal, a server, and the like.

The terminal may be a notebook Computer, a Personal Computer (PC), an on-board Computer, or the like.

The server may be an interworking server or a background server among a plurality of heterogeneous systems, an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, big data, an artificial intelligence platform and the like, and the like.

In an embodiment, as shown in fig. 1, the speech recognition apparatus may be integrated on a computer device such as a terminal or a server, so as to implement the speech recognition method provided in the embodiment of the present application. Specifically, the computer device may obtain at least one speech feature frame of the speech data in the target language; performing phoneme alignment on at least one voice characteristic frame to obtain a target phoneme set of the voice data in the target language; performing word unit alignment on at least one voice characteristic frame to obtain a target word set of voice data in a target language, wherein the target word set comprises a word unit corresponding to each voice characteristic frame; performing text mapping on at least one voice characteristic frame to obtain an initial voice recognition text of voice data in a target language; and adjusting the initial voice recognition text according to the target phoneme set and the target word set to obtain and output the voice recognition text of the voice data.

The following are detailed below, and it should be noted that the order of description of the following examples is not intended to limit the preferred order of the examples.

The embodiment of the present application will be described from the perspective of a speech recognition device, which may be integrated in a computer device, where the computer device may be a server or a terminal.

As shown in fig. 2, a speech recognition method is provided, and the specific process includes:

101. at least one speech feature frame of speech data in a target language is obtained.

The target language may include various languages in which a special phenomenon occurs in daily use. For example, the target language may include a language in which a voice weakening phenomenon exists in daily use. As another example, the target language may include a language in which a tremolo phenomenon exists in daily use. As another example, the target language may include a language in which a swallowing phenomenon exists in daily use, and the like.

Among them, the language in which the voice weakening phenomenon exists may include a language in which vowels and consonants have a weakening phenomenon.

In one embodiment, when the target language includes a language in which a voice weakening phenomenon exists in daily use, the target language may be german, french, russian, italian, or the alty system.

The Altai language family, also called Altai language family, is a group of languages classified by linguists according to the phylum classification method, and includes more than 60 languages. The Altai language family includes Mongolian family, Jueyin family, and Tonggusi family 3. Mainly in central asia, western asia, east asia, siberia and some countries in the eastern part of europe. The method mainly comprises the following steps: mongolian, British, Carmel, Daur, Mandarin, Henberg, Huchoi, Ewenke, Elunchun, Dongxiang, Yugu, Turkman, Ashbya, Utzibek, Kazak, Jiergis, Tatar, and Mogoller.

In one embodiment, the speech feature frames include frames that can identify speech data features.

For example, sound is actually a wave, and what the speech recognition task is faced with is a sequence of samples after several signal processes, also called a waveform. Wherein the waveform may be voice data. The speech feature frame is a data frame obtained after feature extraction is performed on the speech data.

In one embodiment, the speech feature frames have different expression forms according to different ways of feature extraction.

For example, when feature extraction is performed on voice data using Mel-scale Frequency Cepstral coeffients (MFCCs), the voice feature frame may be an MFCC. For another example, when feature extraction is performed on speech data using the filter band method (FBank), the speech feature frame may be FBank. For another example, when feature extraction is performed on a speech data frame using a Linear Prediction Coefficient (LPC), the speech feature frame may be an LPC.

In one embodiment, when feature extraction is performed on speech data using MFCCs, the speech data may first be sliding windowed, dividing the speech data into a frame. For example, as shown in fig. 3, 001 in fig. 3 may be voice data, and the voice data may be divided into one frame by a sliding windowing method. When the voice data is subjected to sliding windowing, the frame length is usually 25ms, and the frame shift is 10ms, so that the stability of signals in the frame can be ensured, the frames are overlapped, and the reliability of the frame is improved.

Next, a Fast Fourier Transform (FFT) may be performed for each frame and a power spectrum calculated. Then, a mel-filter bank is applied to the power spectrum, and the logarithmic energy in each filter is obtained as a coefficient. Finally, Discrete Cosine Transform (DCT) may be performed on the obtained mel-filter logarithmic energy vector, thereby obtaining a speech data frame.

In an embodiment, the speech recognition apparatus provided in the embodiments of the present application may be integrated into various computer devices. For example, the speech recognition device provided in the embodiment of the present application may be integrated into a mobile phone, so that when a person controls the mobile phone through a voice, the mobile phone may recognize corresponding text information in the voice through the speech recognition device. For another example, the speech recognition device provided by the embodiment of the present application may be integrated in various smart homes, so that when people control a smart home through sound, the smart home may recognize corresponding text information in the sound through the speech recognition device.

In an embodiment, in order to more conveniently implement the speech recognition method provided by the embodiment of the present application, a preset speech recognition model is provided by the embodiment of the present application. The preset voice recognition model can be an end-to-end voice recognition model, and voice data can be directly converted into voice recognition texts through the preset voice recognition model, so that the accuracy and the efficiency of voice recognition are improved.

In one embodiment, the model architecture of the predetermined speech recognition model may include an encoding layer, a phoneme alignment layer, a word unit alignment layer, a decoding layer, and an attention layer, for example, as shown in FIG. 4.

The coding layer can acquire voice data and perform feature extraction on the voice data.

The phoneme alignment layer may be configured to perform phoneme alignment on at least one speech feature frame to obtain a target phoneme set of the speech data in the target language.

The word unit alignment layer may be configured to perform word unit alignment on at least one speech feature frame to obtain a target word set of the speech data in the target language.

The attention layer and the decoding layer can perform text mapping on at least one voice feature frame to obtain an initial voice recognition text of the voice data in the target language.

In addition, the preset speech recognition model can also adjust the initial speech recognition text according to the target phoneme set and the target word set to obtain and output the speech recognition text of the speech data.

Wherein, the coding layer can be a machine learning network or a deep learning network. For example, the coding layer may be any one of Convolutional Neural Networks (CNNs), cyclic Neural Networks (RNNs), deconvolution Neural Networks (De-Convolutional Networks, DN), Deep Neural Networks (DNNs), Deep Convolutional Inverse Networks (DCIGNs), Region-based Convolutional Networks (rcnnns), Region-based fast Convolutional Networks (fast RCNNs), and Bidirectional Encoder/decoder (BERT models), among others.

The decoding layer can also be a machine learning network or a deep learning network. For example, the decoding layer may be one of CNN, RNN, DN, DNN, etc. networks.

Wherein the attention layer may include a machine learning network or a deep learning network with an attention mechanism. Among them, the attention mechanism is derived from the study of human vision. In cognitive science, humans selectively focus on a portion of all information while ignoring other visible information due to bottlenecks in information processing. The above mechanism is commonly referred to as an attention mechanism. Different parts of the human retina have different degrees of information processing capabilities, i.e., acuity, with only the foveal part having the strongest acuity. In order to make reasonable use of limited visual information processing resources, a human needs to select a specific portion in a visual region and then focus on it. For example, when a person is reading, only a few words to be read are usually attended to and processed. In summary, the attention mechanism has two main aspects: deciding which part of the input needs to be focused on; limited information processing resources are allocated to the important parts.

The phone alignment layer may have a machine learning network or a deep learning network with a Connection Temporal Classification (CTC) algorithm. For example, the phoneme alignment layer may be a CTC-based modified RNN.

The word unit alignment layer may also be a machine learning network or a deep learning network with CTCs. For example, the word unit alignment layer may be a CTC-based modified RNN.

The phoneme alignment layer is used for performing phoneme alignment on the speech characteristic frames under the phoneme granularity, the word unit alignment layer is used for performing word unit alignment on the speech characteristic frames under the word granularity, and the emphasis points of the phoneme alignment layer and the word unit alignment layer are different.

In an embodiment, the preset language recognition model provided in the embodiment of the present application forms an end-to-end speech recognition hybrid model by combining CTC and attention layer, so that the preset speech recognition model can learn alignment information with more granularity, and thus the model can better find alignment of speech features to various modeling unit sequences (the modeling units in the embodiment of the present application are not phoneme units and word units), and finally the performance of the speech recognition model is improved.

In an embodiment, before performing speech recognition by using the preset speech recognition model, the speech recognition model to be trained may be trained, so as to obtain the preset speech recognition model. Specifically, the step of training the speech recognition model to be trained may include:

acquiring a plurality of phoneme identification frames and a speech recognition model to be trained;

performing phoneme masking processing on the plurality of phoneme identification frames to obtain masked phoneme identification frames;

and training the speech recognition model to be trained by utilizing the masked phoneme identification frame to obtain a preset speech recognition model.

The speech recognition model to be trained comprises a model which needs to be trained and has poor speech recognition performance.

The phoneme identification frame has an audio frame of phoneme identification information, and what the phoneme corresponding to the phoneme frame is can be known through the phoneme identification information in the phoneme identification frame.

The training process belongs to supervised learning by training the speech recognition model to be trained by utilizing the phoneme identification frame, so that a developer can control the training process of the model, and the reliability and the robustness of the training process are improved.

In one embodiment, since the phoneme identification frame has phoneme identification information, before the phoneme identification frame is acquired, training data needs to be acquired, and the phoneme identification frame is generated according to the training data. The process of generating the phoneme identification frame according to the training data may be as shown in fig. 5. Specifically, the training data is first processed by a hidden markov model-gaussian mixture model to obtain first processed data. And then, processing the first processed data by a hidden Markov model-deep neural network to obtain second processed data. Then, the second processed data is subjected to alignment processing to obtain first alignment data. In addition, the training data is also directly subjected to alignment processing, so that second alignment data is obtained. And finally, combining the first alignment data and the second alignment data to obtain a phoneme identification frame.

The training data may include, among other things, various speech data in the target language.

Among them, Hidden Markov Models (HMMs) are statistical models that describe a markov process with hidden unknown parameters, and determine the hidden parameters of the process from the observable parameters, and then use these parameters for further analysis. In the field of speech recognition, HMM models the process of converting a sound feature into a pronunciation unit as a probabilistic problem using two stochastic processes, namely, a state transition process and an observation sampling process, and trains parameters of a hidden markov model through existing speech data. During decoding, the probability of converting the input acoustic features into the specific pronunciation unit sequence is estimated by using the corresponding parameters, so that the probability of outputting specific characters is obtained, and the characters which most possibly represent a certain section of sound are selected.

Among them, a Gaussian Mixture Model (GMM) is a Model that accurately quantizes objects using a Gaussian probability density function (normal distribution curve), and decomposes objects into a plurality of objects based on the Gaussian probability density function (normal distribution curve). In the speech recognition field, in a standard hidden markov model, when an observable is output from a hidden utterance state, it is necessary to model the probability distribution of the output. In classical hidden markov model based speech recognition systems, this process is typically modeled with a Gaussian mixture model (Gaussian mixture model).

The HMM-GMM model is a classical speech recognition system that uses probability theory and statistical knowledge to convert input speech data into text information.

Among them, Deep Neural Networks (DNNs) are also one of the probability models.

In recent years, with the development of artificial intelligence technology, researchers have tried to combine machine learning or deep learning with HMM-GMM models, which are a variation of HMM-GMM models, to improve the performance of speech recognition.

The HMM-DNN model may also convert input speech data into textual information. However, unlike the HMM-GMM, the HMM-DNN requires alignment information at one frame level when processing speech data, whereas the HMM-GMM does not require alignment information at one frame level when processing speech data, and the HMM-GMM can also generate alignment information at a frame level.

Therefore, in generating the phoneme identification frame, the HMM-GMM may be used to generate the phoneme identification information of the training data first, and then the HMM-DNN may generate the first phoneme identification frame of the training data based on the phoneme identification information.

In one example, to improve the reliability of the phoneme identification frame, the training data may be further subjected to phoneme alignment using a model with CTCs to obtain a second phoneme identification frame of the speech data. Then, the first phoneme identification frame and the second phoneme identification frame are combined, so that the phoneme identification frame is obtained.

In an embodiment, in order to improve the accuracy of speech recognition performed by preset speech recognition, a plurality of phoneme identification frames may be subjected to phoneme masking processing, so as to obtain masked phoneme identification frames. And then, training the speech recognition model to be trained by utilizing the masked phoneme identification frame. And the masked phoneme identification frame is used for training the speech recognition model to be trained, so that the speech recognition model to be trained can learn more context knowledge of speech autonomously in the training process, and the recognition performance of the preset speech recognition model is improved.

In an embodiment, when the phoneme masking process is performed on the phoneme identification frame, a conversion process may be performed on part of the phoneme information in the phoneme identification frame so that the part of the phoneme information of the phoneme identification frame is masked. Specifically, the step of "performing phoneme masking processing on the plurality of phoneme identification frames respectively to obtain masked phoneme identification frames" may include:

screening target phoneme information from the phoneme information of the phoneme identification frame;

carrying out information conversion processing on the target phoneme information to obtain converted phoneme information;

and adding the converted phoneme information into the phoneme identification frame to obtain the masked phoneme identification information.

Wherein the phoneme information includes information constituting a phoneme markup frame. For example, as shown in fig. 3, the n frame and the n +1 frame in fig. 3 are phoneme identification frames, wherein the waveform in the phoneme identification frame may be phoneme information.

In one embodiment, several pieces of phoneme information may be screened from the phoneme information as the target phoneme information. For example, as shown in fig. 6, the phoneme information in the phoneme markup frame includes "n", "a", "d", "i", "va", "s", "i", "y", "a", "w", "a", and "H". Then, "i" and "a" can be screened out from these pieces of phoneme information as target phoneme information.

In an embodiment, after the target phoneme information is screened out, information conversion processing may be performed on the target phoneme information, so as to obtain converted phoneme information. For example, the information of the phonemes "i" and "a" may be set to 0. For example, the waveforms corresponding to the phonemes "i" and "a" may be set to 0. For another example, the information of the phonemes "i" and "a" may be added to obtain the converted phoneme information. For example, waveforms of the phonemes "i" and "a" may be superimposed, and the waveform obtained by the superimposition may be used as the converted phoneme information. For another example, the information of the phonemes "i" and "a" may be added and averaged to obtain the converted phoneme information, and so on.

In one embodiment, after obtaining the converted phoneme information, the converted phoneme information may be added to the phoneme identification frame to obtain the masked phoneme identification information.

For example, the converted phoneme information may replace the target phoneme information, thereby obtaining the masked phoneme identification information.

For example, when the information of the phonemes "i" and "a" is set to 0, the waveforms corresponding to the phonemes "i" and "a" in the phoneme identification frame may be replaced with no waveform, thereby obtaining the masked phoneme identification information.

In an embodiment, after the masked phoneme identification frame is obtained, the masked phoneme identification frame may be used to train the to-be-trained speech recognition model, so as to obtain the preset speech recognition model. Specifically, the step of training the speech recognition model to be trained by using the masked phoneme identification frame to obtain the preset speech recognition model may include:

performing feature extraction on the masked phoneme identification frame by using a speech recognition model to be trained to obtain feature information of the masked phoneme identification frame;

respectively performing phoneme alignment and word unit alignment on the feature information by using a speech recognition model to be trained to obtain a target phoneme and a target word unit of the masked phoneme identification frame;

performing text mapping on the characteristic information by using a speech recognition model to be trained to obtain a speech recognition text of the masked phoneme identification frame;

performing joint operation on the target phoneme, the target word unit and the voice recognition text to obtain joint loss information;

and adjusting model parameters of the speech recognition model to be trained according to the joint loss information to obtain a preset speech recognition model.

The training of the model may include learning the model from massive data, so that the model may summarize a rule from the massive data, and may process data input into the model at will according to the rule.

The training of the speech recognition model to be trained may be to make the masked speech recognition frame of the speech model to be trained perform speech recognition, so that the masked speech recognition frame of the speech recognition model learns how to convert the speech data into text information.

In one embodiment, a flow diagram of a speech recognition model to be trained may be as shown in FIG. 7. The method comprises the steps of obtaining a speech recognition model to be trained, and obtaining feature information of a masked phoneme identification frame by utilizing a coding layer in the speech recognition model to be trained.

In one embodiment, the training process for the coding layer is to let the coding layer learn knowledge about acoustics, i.e., learn which modeling unit the current masked phoneme identification frame is more like, and use vector identification. Unlike the conventional training method, in the embodiment of the present application, the coding layer is trained by using the masked phoneme identification frame, so that the coding layer can be encouraged to learn to fully predict the identification of the current masked phoneme identification frame by using the surrounding speech frames.

In an embodiment, in the end-to-end speech recognition hybrid model, the role of CTC is generally to make the model training converge faster and additionally to assist the coding layer, thereby improving the performance of model recognition. In the pre-configured speech recognition model proposed in the embodiment of the present application, the CTC may also know that the coding layer learns more from the masked phone identification frame. Therefore, the phoneme alignment layer in the speech recognition model to be trained can be used for carrying out phoneme alignment on the feature information, and the word unit alignment layer is used for carrying out word unit alignment on the feature information, so that the target phoneme and the target word unit of the masked phoneme identification frame are obtained. In addition, text information can be carried out on the feature information by utilizing an attention layer and a decoding layer in the speech recognition model to be trained, so that the speech recognition text of the masked phoneme identification frame is obtained.

The method comprises the steps of utilizing a speech recognition model to be trained to respectively align phonemes and word units of feature information to obtain target phonemes and target word units of a masked phoneme identification frame, and utilizing the speech recognition model to be trained to perform text mapping on the feature information to obtain a speech recognition text of the masked phoneme identification frame. The step of performing the phoneme alignment and the word unit alignment on the feature information by using the speech recognition model to be trained to obtain the target phoneme and the target word unit of the masked phoneme identification frame may be performed first, the step of performing the text mapping on the feature information by using the speech recognition model to be trained to obtain the speech recognition text of the masked phoneme identification frame may be performed first, and the two steps may be performed in parallel.

And then, performing joint operation on the target phoneme, the target word unit and the voice recognition text to obtain joint loss information, and performing model parameter adjustment on the to-be-trained voice recognition model according to the joint loss information to obtain a preset voice recognition model.

In an embodiment, when performing a joint operation on the target phoneme, the target word unit, and the speech recognition text to obtain joint loss information, alignment loss information of the target phoneme and the target word unit may be calculated, and text loss information of the speech recognition text and the preset identification text may be calculated. And then, fusing the alignment loss information and the text loss information to obtain joint loss information. Specifically, the step of performing a joint operation on the target phoneme, the target word unit, and the speech recognition text to obtain joint loss information may include:

calculating alignment loss information of a target phoneme and the target word unit;

calculating text loss information between the voice recognition text and a preset identification text;

and fusing the alignment loss information and the text loss information to obtain joint loss information.

The alignment loss information includes phoneme loss information between the target phoneme and the preset phoneme and word unit loss information between the target word unit and the preset word unit.

In one embodiment, since the phoneme identification frame has phoneme identification information, the speech recognition model to be trained may have a preset phoneme and a preset word unit of the phoneme identification frame. That is, through the phoneme identification information, the speech recognition model to be trained already knows which phonemes and which word units correspond to the phoneme identification frame. Therefore, when the alignment loss information is calculated, phoneme loss information between the target phoneme and the preset phoneme and word unit loss information between the target word unit and the preset word unit may be calculated, respectively.

Wherein the phoneme loss information between the target phoneme and the preset phoneme may be calculated using the CTC function. Similarly, word unit loss information between the target word unit and the preset word unit can also be calculated by using the CTC function.

In an embodiment, after obtaining the phoneme loss information and the word unit loss information, the phoneme loss information and the word unit loss information may be fused to obtain the alignment loss information.

For example, if alignment Loss information may be expressed as a Loss _ CTC, phoneme Loss information may be expressed as a Loss _ graph _ CTC, and word unit Loss information may be expressed as a Loss _ word-piece _ CTC, the phoneme Loss information and the word unit Loss information may be fused according to the following formula, so as to obtain the alignment Loss information:

Loss_CTC＝Loss_word-piece_CTC+Theta×Loss_grapheme_CTC

in one embodiment, in calculating the text loss information between the voice recognition text and the preset identification text, the text loss information between the voice recognition text and the preset identification text may be calculated using a Cross-entropy function (CE loss). Then, the alignment loss information and the text loss information may be fused, thereby obtaining joint loss information.

For example, the text Loss information may be represented as Loss _ CE, and the Joint Loss information may be represented as Loss _ Joint. Then, the text loss information and the joint loss information may be fused according to the following formula, so as to obtain the joint loss information:

Loss_Joint＝Loss_CE+Alpha×Loss_CTC

in an embodiment, after obtaining the joint loss information, the model parameter adjustment may be performed on the speech recognition model to be trained according to the joint loss information, so as to obtain the preset speech recognition model

102. And performing phoneme alignment on at least one voice characteristic frame to obtain a target phoneme set of the voice data in the target language.

In an embodiment, after at least one speech feature frame of the speech data is acquired, the speech feature frame may be subjected to phoneme alignment, so as to obtain a target phoneme set of the speech data in the target language.

The phoneme includes the minimum unit in the speech, and is the smallest speech unit divided from the perspective of tone.

For example, there are 48 phones in english, wherein 20 vowel phones and 28 consonant phones. As another example, there are 32 phonemes in Chinese, where there are 10 vowels, 22 consonants, and so on.

For example, taking the chinese language as an example, the chinese syllables o (ā) have only one phoneme, love (aji) has two phonemes, and generation (d aji) has three phonemes.

Where phoneme alignment refers to what determines the corresponding phoneme in each speech feature frame.

The target phoneme set comprises a set of phonemes corresponding to each speech feature frame. And through the target phoneme set, the computer device can acquire what the pronunciation of the voice data is.

In an embodiment, when performing phoneme alignment on the speech feature frames, a path search may be performed on each speech feature frame in a preset phoneme search space to obtain at least one phoneme search path, and a target phoneme set is generated in a phoneme search path. Specifically, the step of "performing phoneme alignment on at least one speech feature frame respectively to obtain a target phoneme set of the speech data in the target language" may include:

performing path search on each voice characteristic frame in a preset phoneme search space to obtain at least one phoneme search path;

calculating the cumulative probability of the voice characteristic frame on each phoneme searching path;

a target phone set of the speech data is determined based on the cumulative probabilities.

Wherein, the preset phoneme search space may include a space composed of acoustic knowledge in the target language. What features each phoneme has under the target language, and the relationship between the individual phonemes, etc. are defined in the preset phoneme search space.

For example, the MFCCs corresponding to each phoneme in the target language may be included in the preset phoneme search space. For another example, if some phonemes are always used together, the distance between the phonemes in the preset phoneme search space is relatively short. Conversely, if some phonemes are not used together, the distance between the phonemes in the preset phoneme search space is relatively far.

In one embodiment, the predetermined phoneme search space may have a variety of representations. For example, the preset phoneme search space may be a matrix. For another example, the preset phoneme search space may be a graph structure. As another example, the preset phoneme search space may be a tree structure, and so on.

In an embodiment, each phoneme feature frame may be subjected to a path search in a preset phoneme search space to obtain at least one phoneme search path. The phoneme search space comprises a plurality of phoneme sets, so that when each phoneme feature frame is subjected to path search in the preset phoneme search space, path search can be carried out on the phoneme features according to the phoneme sets. Specifically, the step of performing a path search on each speech feature frame in a preset phoneme search space to obtain at least one phoneme search path may include:

performing feature enhancement on the voice feature frame under the phoneme granularity to obtain the phoneme features of the voice feature frame;

screening a target phoneme set from the multiple phoneme sets according to the phoneme characteristics;

and performing phoneme search in the target phoneme set according to the phoneme characteristics, and generating at least one phoneme search path according to the search result.

In one embodiment, to improve efficiency, phonemes with similar phoneme characteristics may be grouped together to form a phoneme feature set. And a plurality of phoneme feature sets are collected together to form a preset phoneme search space.

In one embodiment, in order to obtain the phoneme search path more accurately, the speech feature frames may be subjected to feature enhancement at the phoneme granularity to obtain the phoneme features of the speech feature frames. For example, the speech feature frames may be spectrally enhanced to obtain phoneme features of the speech feature frames.

Next, a target phoneme combination can be screened out from the multiple phoneme sets according to the phoneme characteristics. In an embodiment, each phone set may include at least one preset phone, and thus, the phone features of each preset phone may be normalized to serve as the identified phone features of the phone set. Then, the phoneme features can be matched with the identified phoneme features on each phoneme set, and a target phoneme set is screened out according to the matching result.

When the matching degree of the phoneme feature and the preset phoneme features on the multiple phoneme sets is the same, the multiple phoneme sets can be used as the target phoneme set.

In an embodiment, after the target phoneme set is screened out, a phoneme search may be performed in the target phoneme set according to the phoneme characteristics, and at least one phoneme search path may be generated according to the search result. Specifically, the step of performing a phoneme search in the target phoneme set according to the phoneme characteristics and generating at least one phoneme search path according to the search result may include:

respectively calculating the matching probability between the phoneme characteristics and at least one preset phoneme;

determining a target phoneme in at least one preset phoneme according to the matching probability;

and associating the target phoneme of each phoneme feature to obtain a phoneme search path.

When the matching probability between the phoneme feature and each preset phoneme is calculated, various probability algorithms can be used for calculation. For example, a match probability between the phoneme feature and at least one preset phoneme may be calculated using a Maximum Likelihood Estimation (MLE) algorithm. Then, according to the matching probability, a target phoneme can be determined in at least one preset phoneme, and the target phoneme of each phoneme feature in each voice data is associated, so that a target search path is obtained.

For example, as shown in fig. 8, 002 in fig. 8 may be a preset phoneme search space. Then, each column in the speech phoneme search space may be a phoneme set, for example 003 in fig. 8 may be a phoneme set. Then, the phoneme set comprises at least one preset phoneme. For example, 004 in fig. 8 may be a preset phoneme.

In one embodiment, as shown in fig. 9, at least one phoneme search path is obtained by performing a road search on each speech feature frame in the speech data. After obtaining at least one phoneme search path, the accumulated probability of each speech feature frame on each phoneme search path may be calculated, and the target phoneme set of the speech data may be determined according to the accumulated probability.

For example, as shown in fig. 9, when the cumulative probability of the speech feature frame on the phoneme search path 005 is calculated, the matching probability between each target phoneme and the phoneme feature on the phoneme search path 005 may be cumulatively operated to obtain the cumulative probability.

Specifically, the step of "calculating the cumulative probability of the speech feature frame on each phoneme search path" may include:

obtaining the matching probability of each target phoneme on the phoneme search path

And performing cumulative operation on the matching probability of each target phoneme to obtain the cumulative probability.

The matching probability of each target phoneme can be accumulated in various ways to obtain the accumulated probability. For example, the matching probabilities for each target phoneme may be added to obtain the cumulative probability. For another example, the matching probabilities of each target phoneme may be weighted and summed to obtain the cumulative probability.

In one embodiment, after obtaining the cumulative probability of each phoneme search path, the target phoneme set of the speech data may be determined according to the cumulative probability. Specifically, the step of "determining a target phoneme set of the speech data according to the cumulative probability" may include:

comparing the cumulative probability on each phoneme search path to obtain a comparison result;

determining a target phoneme search path in at least one phoneme search path according to the comparison result;

and splicing the target phonemes on the target phoneme search path to obtain a target phoneme set.

For example, as shown in the figure, the accumulated probability of the phoneme search path 005 is the largest by comparing the accumulated probabilities, so the phoneme search path 005 can be determined as the target phoneme search path. Then, the target phonemes on the target phoneme search path may be spliced, so as to obtain a target phoneme set. For example, as shown, the target phonemes on the target phoneme set are "a:", "t", "-", "i", "-", "," l ", respectively, wherein" - "may refer to a blank phoneme. Therefore, when the target phoneme is spliced, the blank phoneme can be deleted, so that the target phoneme set is obtained. For example, after the target phonemes in the graph are spliced, the target phoneme set can be obtained as "tikl".

By performing path search on each voice feature frame in the preset phoneme search space, the obtained target phoneme set can sufficiently represent phonemes possibly included in the voice data, and the accuracy of performing phoneme alignment on the voice feature frames is improved.

103. And performing word unit alignment on at least one voice characteristic frame to obtain a target word set of the voice data in the target language, wherein the target word set comprises a word unit corresponding to each voice characteristic frame.

In an embodiment, word unit alignment may be performed on at least one speech feature frame, so as to obtain a target word set of the speech data in the target language.

Where a word unit may be the smallest unit that constitutes a word in the target language. For example, in the English example, word units of the word "hello" may include "he" and "llo". As another example, the word element of the word "logging" may include "lov" and "ing".

In one embodiment, word units in the target language may be obtained by a Subword (Subword) algorithm. The subword algorithm may include an algorithm that may convert a word into a word unit, among other things. For example, the subword algorithm may include a Byte Pair Encoding (BPE) algorithm, a word-cut (word-piece) algorithm, and so on.

For example, a word-piece algorithm may be provided in a word unit alignment layer in the preset speech recognition model provided in the embodiment of the present application, so that the word unit alignment layer may perform word unit alignment on a speech recognition frame through the word-piece algorithm, thereby improving the accuracy of word unit alignment.

In one embodiment, a similar method for phoneme alignment of the speech feature frames may be used when performing word unit alignment of at least one speech feature frame. The difference is that word unit alignment aligns the speech feature frames at the word granularity, and phoneme alignment aligns the speech feature frames at the phoneme granularity. Specifically, the step "respectively perform word unit alignment on the at least one speech feature frame to obtain a target word set of the speech data in the target language" may include:

performing path search on each voice characteristic frame in a preset dictionary search space to obtain at least one word unit search path;

calculating the cumulative probability of the voice characteristic frame on each word unit search path;

and determining a target word set of the voice characteristic frame according to the cumulative probability.

Wherein the preset dictionary search space may include a space made up of word knowledge in the target language. Phonemes corresponding to each word unit in the target language, what characteristics the phonemes corresponding to each word unit have, the relationship between the word units, and the like are defined in the preset dictionary search space.

In an embodiment, the predetermined dictionary search space may also include a plurality of word unit sets, and the step of performing a path search on each of the speech feature frames in the predetermined dictionary search space may refer to the step of performing phoneme alignment on the speech feature frames. Therefore, the step of performing a path search on each speech feature frame in a preset dictionary search space to obtain at least one word unit search path may include:

performing feature enhancement on the voice feature frame under the word unit granularity to obtain word unit features of the voice feature frame;

screening out a target word unit set from the word unit sets according to the word unit characteristics;

and performing word search in the target word unit set according to the word unit characteristics, and generating at least one word unit search path according to the search result.

In one embodiment, to improve efficiency, word units with similar word unit characteristics may be grouped together to form a word unit set. And a plurality of word unit sets are collected together to form a preset dictionary search space.

In one embodiment, to obtain the word unit search path more accurately, the speech feature frames may be feature enhanced at the word unit granularity, so as to obtain the word unit features of the speech feature frames. For example, the speech feature frame may be subjected to feature extraction again, so as to obtain word unit features.

Then, a target word unit set can be screened out from the multiple word unit sets according to the word unit characteristics. The step of screening out the target word unit from the plurality of word unit sets may refer to the step of "screening out the target phoneme set from the plurality of phoneme sets", which is not repeated herein.

In one embodiment, after the target word unit is screened out, a word search may be performed in the target word unit set according to the word unit characteristics, and at least one word unit search path may be generated according to the search result. Specifically, the step of performing word search in the target word unit set according to the word unit features and generating at least one word unit search path according to the search result may include:

respectively calculating the matching probability between the word unit characteristics and at least one preset word unit;

determining a target word unit in at least one preset word unit according to the matching probability;

and associating the target word units of each word unit characteristic to obtain a word unit search path.

By performing phoneme alignment and word unit alignment on the speech feature frames, a target phoneme set and a target word set of the speech data can be obtained. The initial speech recognition text can then be adjusted using the target word set and the target phoneme set, thereby improving the accuracy of the speech recognition text.

104. And performing text mapping on at least one voice characteristic frame to obtain an initial voice recognition text of the voice data in the target language.

In an embodiment, the embodiment of the present application may further perform text mapping on at least one speech feature frame, so as to obtain an initial speech recognition text of the speech data in the target speech. Specifically, the step of performing text mapping on at least one speech feature frame to obtain an initial speech recognition text of the speech data in the target language may include:

performing attention feature extraction on the voice feature frame on a plurality of attention dimensions to obtain attention features of the voice feature frame on each attention dimension;

and decoding the attention features on each attention dimension to obtain an initial speech recognition text of the speech data in the target language.

In one embodiment, in order to improve the accuracy of speech recognition, attention feature extraction may be performed on speech feature frames in multiple attention dimensions, so as to obtain attention features of the speech feature frames in each attention dimension.

The attention feature extraction can be performed on the speech feature frame by using a multi-head attention mechanism, and each head of attention mechanism can correspond to one attention dimension.

In one embodiment, after obtaining the attention feature, the attention feature in each attention dimension may be decoded to obtain an initial speech recognition text of the speech data in the target language.

For example, a feature distribution of the attention feature in a preset text mapping space may be calculated, and then text information corresponding to the attention feature may be determined according to the feature distribution.

In an embodiment, it should be noted that there is no sequential limitation between the steps "performing phoneme alignment on at least one speech feature frame respectively to obtain a target phoneme set of the speech data in the target language", the steps "performing word unit alignment on at least one speech feature frame respectively to obtain a target word set of the speech data in the target language", and the steps "performing text mapping on at least one speech feature frame respectively to obtain an initial speech recognition text of the speech data in the target language". For example, the above steps may be performed in a sequential order, or may be performed in parallel.

105. And adjusting the initial voice recognition text according to the target phoneme set and the target word set to obtain and output the voice recognition text of the voice data.

In an embodiment, after the target phoneme set, the target word set and the initial speech recognition text are obtained, the initial speech recognition text may be adjusted according to the target phoneme set and the target word set to obtain and output a speech recognition text of the speech data. Specifically, the step of adjusting the initial speech recognition text according to the target phoneme set and the target word set to obtain and output the speech recognition text of the speech data may include:

recognizing phoneme information of an initial speech recognition text;

adjusting the phoneme information by using the target phoneme set to obtain a phoneme-adjusted speech recognition text;

and adjusting the voice recognition text after the phoneme adjustment by using the target word set to obtain and output the voice recognition text of the voice data.

For example, a matching degree between the target phoneme set and the phoneme information of the initial speech recognition text may be calculated, and when the matching degree of the target phoneme set and the phoneme information of the initial speech recognition text reaches a preset threshold, the initial speech recognition text is used as the phoneme-adjusted speech recognition text. And when the matching degree of the two is not equal to the preset threshold value, the unmatched phonemes in the initial voice recognition text can be replaced by the phonemes in the target phoneme set, so that the voice recognition text after phoneme adjustment is obtained.

After the phoneme-adjusted speech recognition text is obtained, the phoneme-adjusted speech recognition text may be adjusted by using the target word set, so as to obtain and output a speech recognition text of the speech data. For example, the text similarity between the target word set and the phoneme-adjusted speech recognition text may be calculated using a cosine distance, an edit distance, a twin network, a word vector algorithm, and the like. And when the text similarity of the two documents reaches a preset threshold value, outputting the voice recognition file after the phoneme adjustment.

The embodiment of the application provides a voice recognition method, which comprises the steps of obtaining at least one voice characteristic frame of voice data in a target language; respectively carrying out phoneme alignment on at least one voice characteristic frame to obtain a target phoneme set of the voice data in the target language; respectively carrying out word unit alignment on at least one voice characteristic frame to obtain a target word set of voice data in a target language, wherein the target word set comprises a word unit corresponding to each voice characteristic frame; respectively carrying out text mapping on at least one voice characteristic frame to obtain an initial voice recognition text of the voice data in the target language; and adjusting the initial voice recognition text according to the target phoneme set and the target word set to obtain and output the voice recognition text of the voice data. In the embodiment of the application, the speech feature frames can be aligned in two dimensions of phoneme and word units, and the initial speech recognition text is adjusted by using the target phoneme set and the target word set, so that the matching degree between the speech recognition text and the speech data can be improved, and the accuracy of speech recognition can be improved.

In addition, the embodiment of the application also correspondingly provides a preset speech recognition model, and the preset speech recognition model is obtained by training the masked phoneme features. By the masked phoneme characteristics, the preset speech recognition model can predict the masked part by using more surrounding information, so that the robustness of the preset speech recognition model is improved. In addition, the preset speech recognition model comprises a phoneme alignment layer and a word unit alignment layer, so that the preset speech recognition model can learn more granular alignment messages, and the performance of the speech recognition model is improved.

The method described in the above examples is further illustrated in detail below by way of example.

The method of the embodiment of the present application will be described by taking an example in which a speech recognition method is integrated on a server.

In an embodiment, as shown in fig. 9, a speech recognition method includes the following specific processes:

201. the computer device obtains at least one speech feature frame of speech data in a target language.

202. And the computer equipment respectively carries out phoneme alignment on at least one voice characteristic frame to obtain a target phoneme set of the voice data in the target language.

203. And respectively carrying out word unit alignment on at least one voice characteristic frame by the computer equipment to obtain a target word set of the voice data in the target language, wherein the target word set comprises a word unit corresponding to each voice characteristic frame.

204. And respectively carrying out text mapping on at least one voice characteristic frame by the computer equipment to obtain an initial voice recognition text of the voice data in the target language.

205. And the computer equipment adjusts the initial voice recognition text according to the target phoneme set and the target word set to obtain and output the voice recognition text of the voice data.

In the embodiment of the application, the computer equipment can acquire at least one voice feature frame of voice data in a target language; the computer equipment can respectively carry out phoneme alignment on at least one voice characteristic frame to obtain a target phoneme set of the voice data in the target language; the computer equipment can respectively align word units of at least one voice characteristic frame to obtain a target word set of the voice data in the target language, wherein the target word set comprises word units corresponding to each voice characteristic frame; the computer equipment can respectively carry out text mapping on at least one voice characteristic frame to obtain an initial voice recognition text of the voice data in the target language; the computer device can adjust the initial speech recognition text according to the target phoneme set and the target word set to obtain and output a speech recognition text of the speech data. In the embodiment of the present application, the computer device may align the speech feature frames in two dimensions of a phoneme and a word unit, and adjust the initial speech recognition text by using the target phoneme set and the target word set, so that a matching degree between the speech recognition text and the speech data may be improved, thereby improving an accuracy of speech recognition.

In order to better implement the speech recognition method provided by the embodiment of the present application, in an embodiment, a speech recognition based apparatus is further provided, and the speech recognition based apparatus may be integrated into a computer device. Wherein the meaning of the noun is the same as that in the voice recognition method, and the specific implementation details can refer to the description in the method embodiment.

In an embodiment, there is provided a speech recognition based apparatus, which may be specifically integrated in a computer device, as shown in fig. 11, the speech recognition based apparatus includes: the obtaining unit 301, the phoneme aligning unit 302, the word unit aligning unit 303, the text mapping unit 304, and the adjusting unit 305 are specifically as follows:

an obtaining unit 301, configured to obtain at least one speech feature frame of speech data in a target language;

a phoneme aligning unit 302, configured to perform phoneme alignment on the at least one speech feature frame respectively to obtain a target phoneme set of the speech data in the target language;

a word unit alignment unit 303, configured to perform word unit alignment on the at least one speech feature frame respectively to obtain a target word set of the speech data in the target language, where the target word set includes a word unit corresponding to each speech feature frame;

a text mapping unit 304, configured to perform text mapping on the at least one speech feature frame, respectively, to obtain an initial speech recognition text of the speech data in a target language;

an adjusting unit 305, configured to adjust the initial speech recognition text according to the target phoneme set and the target word set, so as to obtain and output a speech recognition text of the speech data.

In one embodiment, the phoneme alignment unit includes:

In one embodiment, the path searching subunit includes:

In one embodiment, the phoneme search module includes:

In one embodiment, the word unit alignment unit includes:

In one embodiment, the text mapping unit includes:

In an embodiment, the adjusting unit includes:

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

The device based on the voice recognition can improve the convenience of people taking the vehicle.

The embodiment of the present application further provides a computer device, where the computer device may include a terminal or a server, for example, the computer device may be a terminal based on voice recognition, and the terminal may be a mobile phone, a tablet computer, or the like; also for example, the computer device may be a server, such as a speech recognition based server. As shown in fig. 12, it shows a schematic structural diagram of a terminal according to an embodiment of the present application, specifically:

the computer device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 12 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby monitoring the computer device as a whole. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user pages, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 via a power management system, so that functions of managing charging, discharging, and power consumption are implemented via the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The computer device may also include an input unit 404, the input unit 404 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions as follows:

respectively carrying out phoneme alignment on the at least one voice characteristic frame to obtain a target phoneme set of the voice data in the target language;

respectively carrying out word unit alignment on the at least one voice characteristic frame to obtain a target word set of the voice data in the target language, wherein the target word set comprises a word unit corresponding to each voice characteristic frame;

respectively carrying out text mapping on the at least one voice characteristic frame to obtain an initial voice recognition text of the voice data in a target language;

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the above embodiments.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by a computer program, which may be stored in a computer-readable storage medium and loaded and executed by a processor, or by related hardware controlled by the computer program.

To this end, the present application further provides a storage medium, in which a computer program is stored, where the computer program can be loaded by a processor to execute the steps in any one of the speech recognition methods provided in the embodiments of the present application. For example, the computer program may perform the steps of:

Since the computer program stored in the storage medium can execute the steps in any of the speech recognition methods provided in the embodiments of the present application, the beneficial effects that can be achieved by any of the speech recognition methods provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described again here.

The foregoing describes in detail a speech recognition method, apparatus, computer device, and storage medium provided by embodiments of the present application, and specific examples are applied herein to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A speech recognition method, comprising:

2. The method of claim 1, wherein the phoneme aligning the at least one speech feature frame to obtain a target phoneme set of the speech data in the target language comprises:

and determining a target phoneme set of the voice data according to the cumulative probability.

3. The method of claim 2, wherein the preset phoneme search space includes a plurality of phoneme sets; the performing path search on each voice feature frame in a preset phoneme search space to obtain at least one phoneme search path includes:

screening out a target phoneme set from the multiple phoneme sets according to the phoneme characteristics;

and searching phonemes in the target phoneme set according to the phoneme characteristics, and generating at least one phoneme search path according to a search result.

4. The method of claim 3 wherein said phone set comprises at least one preset phone; the performing phoneme search in the target phoneme set according to the phoneme characteristics and generating at least one phoneme search path according to a search result includes:

respectively calculating the matching probability between the phoneme characteristics and the at least one preset phoneme;

determining a target phoneme in the at least one preset phoneme according to the matching probability;

and associating the target phoneme of each phoneme feature to obtain a target search path.

5. The method of claim 1, wherein said word unit aligning said at least one speech feature frame to obtain a target word set of said speech data in said target language comprises:

and determining a target word set of the voice feature frame according to the cumulative probability.

6. The method of claim 1, wherein the text mapping the at least one speech feature frame to obtain an initial speech recognition text of the speech data in a target language comprises:

and decoding the attention features on the attention dimensions to obtain an initial speech recognition text of the speech data in the target language.

7. The method of claim 1, wherein the adjusting the initial speech recognition text according to the target phoneme set and the target word set to obtain and output the speech recognition text of the speech data comprises:

recognizing phoneme information of the initial speech recognition text;

8. The method of claim 1, wherein the phoneme aligning the at least one speech feature frame to obtain a target phoneme set of the speech data in the target language comprises:

performing phoneme alignment on the at least one voice characteristic frame by using a preset voice recognition model to obtain a target phoneme set of the voice data in the target language;

performing word unit alignment on the at least one voice feature frame to obtain a target word set of the voice data in the target language, including:

performing word unit alignment on the at least one voice characteristic frame by using a preset voice recognition model to obtain a target word set of the voice data in the target language;

the performing text mapping on the at least one voice feature frame to obtain an initial voice recognition text of the voice data in the target language includes:

performing text mapping on the at least one voice characteristic frame by using a preset voice recognition model to obtain an initial voice recognition text of the voice data in a target language;

the adjusting the initial speech recognition text according to the target phoneme set and the target word set to obtain and output the speech recognition text of the speech data includes:

and adjusting the initial voice recognition text according to the target phoneme set and the target word set by using a preset voice recognition model to obtain and output the voice recognition text of the voice data.

9. The method of claim 8, wherein said phone aligning said at least one speech feature frame using a predetermined speech recognition model to obtain said speech data before a target phone set in said target language comprises:

and training the speech recognition model to be trained by utilizing the masked phoneme identification frame to obtain the preset speech recognition model.

10. The method of claim 9 wherein said phoneme identification frame includes phoneme information; the performing phoneme masking processing on the plurality of phoneme identification frames to obtain masked phoneme identification frames includes:

performing information conversion processing on the target phoneme information to obtain converted phoneme information;

and filling the converted phoneme information into the phoneme identification frame to obtain the masked phoneme identification information.

11. The method of claim 9, wherein the training the speech recognition model to be trained by using the masked phoneme identification frame to obtain a preset speech recognition model comprises:

performing feature extraction on the masked phoneme identification frame by using the to-be-trained speech recognition model to obtain feature information of the masked phoneme identification frame;

respectively performing phoneme alignment and word unit alignment on the feature information by using the to-be-trained speech recognition model to obtain a target phoneme and a target word unit of the masked phoneme identification frame;

performing text mapping on the feature information by using the to-be-trained speech recognition model to obtain a speech recognition text of the masked phoneme identification frame;

and adjusting model parameters of the voice recognition model to be trained according to the joint loss information to obtain the preset voice recognition model.

12. The method of claim 11, wherein said performing a joint operation on said target phonemes, said target word units and said speech recognition text to obtain joint loss information comprises:

calculating alignment loss information of the target phoneme and the target word unit;

13. A speech recognition apparatus, comprising:

14. A computer device comprising a memory and a processor; the memory stores an application program, and the processor is configured to execute the application program in the memory to perform the operations of the speech recognition method according to any one of claims 1 to 12.

15. A computer-readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the speech recognition method according to any one of claims 1 to 12.