CN114999463B

CN114999463B - Voice recognition method, device, equipment and medium

Info

Publication number: CN114999463B
Application number: CN202210916222.2A
Authority: CN
Inventors: 古树桦; 林余楚; 周艺恺
Original assignee: Shenyi Information Technology Zhuhai Co ltd
Current assignee: Shenyi Information Technology Zhuhai Co ltd
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2022-11-15
Anticipated expiration: 2042-08-01
Also published as: CN114999463A

Abstract

The invention discloses a voice recognition method, a device, equipment and a medium, comprising the following steps: the method comprises the steps of performing acoustic preprocessing on a voice signal to be recognized to obtain a plurality of voice audio segments, performing transfer learning on an initial acoustic recognition model by adopting the voice audio segments to obtain a self-adaptive acoustic recognition model, wherein the initial acoustic recognition model is an acoustic recognition model of high-resource language, recognizing the plurality of voice audio segments according to the self-adaptive acoustic recognition model to obtain acoustic recognition characteristics, performing semantic recognition on the acoustic recognition characteristics based on a context neural language model to obtain a voice recognition result, and the accuracy of voice recognition can be improved by adopting the method.

Description

Speech recognition method, apparatus, device and medium

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, device, and medium.

Background

The speech recognition technology has gradually matured, and has reached a practical level in a specific field and a specific environment. However, under the circumstances of free sounding, high noise, simultaneous sounding, far-end sound field, etc., the performance of machine recognition is still far from satisfactory.

For the above uncertain factors, the existing speech recognition technology has the following negative effects: noise, different speakers and personal accent problems in a free pronunciation scene still influence the voice recognition technology by serving as input modes; the recognition performance of the dialect language and the dialect language with scarce resources can not complement the complement of the voice recognition technology by the data amount, so that the voice semantic recognition effect of the speaker is poor.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a voice recognition device, computer equipment and a storage medium, which are used for improving the accuracy of semantic recognition.

In order to solve the foregoing technical problem, an embodiment of the present application provides a speech recognition method, which is applied to a distributed cluster, where the distributed cluster includes a plurality of node servers, and the speech recognition method includes:

carrying out acoustic preprocessing on a voice signal to be recognized to obtain a plurality of voice audio segments;

adopting the human voice audio clip to perform transfer learning on an initial acoustic recognition model to obtain a self-adaptive acoustic recognition model, wherein the initial acoustic recognition model is an acoustic recognition model of a high-resource language;

recognizing the plurality of voice audio segments according to the self-adaptive acoustic recognition model to obtain acoustic recognition characteristics;

and performing semantic recognition on the acoustic recognition features based on a context neural language model to obtain a voice recognition result.

Optionally, the acoustic preprocessing is performed on the speech signal to be recognized to obtain a plurality of human voice audio segments, including:

carrying out voice extraction on the voice signal to be recognized to obtain a first voice signal;

carrying out recognition training on the first human voice signal by adopting an acoustic pre-training model to obtain a second human voice signal;

and based on the CTC end-to-end network, segmenting and aligning the second voice signal to obtain a plurality of voice audio segments.

Optionally, the voice extraction is performed by means of decibel filtering and/or frequency spectrum detection.

Optionally, the end-to-end CTC-based network segmenting and aligning the second voice signal to obtain a plurality of voice audio segments includes:

performing framing processing on the second voice signal to obtain a voice frame signal;

for each voice frame signal, determining an alignment mode of a transcribed text corresponding to each audio frame signal and the second voice signal based on the end-to-end network of CTCs;

determining a segmentation position according to the alignment mode;

and segmenting the second voice signal through the segmentation position to obtain a plurality of voice audio signals.

Optionally, the determining, for each of the voice frame signals, an alignment manner of the transcribed text corresponding to each audio frame signal and the second human voice signal based on the end-to-end network of CTCs includes:

identifying the voice frame signal based on the end-to-end network of the CTC to obtain the character posterior probability corresponding to the voice frame signal;

calculating the posterior probabilities of all the characters by adopting a dynamic programming mode to obtain the maximum joint probability;

and determining the alignment mode of the speech frame signal and the transcribed text corresponding to the second human voice signal based on the maximum joint probability.

Optionally, the initial acoustic recognition model is a hidden markov speech recognition model.

Optionally, the performing transfer learning on the initial acoustic recognition model by using the human voice audio segment to obtain an adaptive acoustic recognition model includes:

inputting the human voice audio clip into the initial acoustic recognition model for acoustic feature extraction to obtain a first acoustic feature;

and performing migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature corresponding to the high-resource language corpus, the first acoustic feature and a gradient inversion layer to obtain the self-adaptive acoustic recognition model, wherein the gradient inversion layer comprises a DNN structural feature extractor, a multi-element phoneme classifier and a domain classifier based on a learning acoustic model.

Optionally, the performing migration countermeasure training on the initial acoustic recognition model by using a second acoustic feature, the first acoustic feature, and a gradient inversion layer corresponding to the high-resource language corpus to obtain the adaptive acoustic recognition model includes:

a DNN structural feature extractor based on a learning acoustic model is used for respectively extracting the first acoustic feature and the second acoustic feature to obtain a first structural feature corresponding to the first acoustic feature and a second structural feature corresponding to the second acoustic feature;

aiming at the first structural feature and the second structural feature, adopting the multi-element phoneme classifier to classify phonemes, and generating phoneme labels according to a classification result;

for the first structural feature and the second structural feature, performing domain classification by using the domain classifier, and generating a domain label according to a classification result;

and performing migration countermeasure training on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature and the second structural feature to obtain the self-adaptive acoustic recognition model.

Optionally, the domain label includes a source domain and a target domain, and the performing migration countermeasure training on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature, and the second structural feature to obtain the adaptive acoustic recognition model includes:

performing public feature extraction and private feature extraction on first structural features and second structural features of tags in different fields by adopting an encoder of a domain separation network to obtain public features and private features corresponding to each field tag, wherein the domain separation network comprises a target domain private encoder, a source domain private encoder, a shared decoder and a classifier;

decoding a sample consisting of the private characteristics and the public characteristics through a shared decoder of a domain separation network, and performing classification training on a source domain sample after decoding by adopting a classifier to obtain a trained classifier;

classifying the target domain data by adopting the trained classifier to obtain a target classification;

and performing migration countermeasure training by adopting the maximum mean difference as a loss function according to the common characteristics corresponding to the target classification to obtain the self-adaptive acoustic recognition model.

Optionally, the semantic recognition of the acoustic recognition features by the context-based neural language model to obtain a speech recognition result includes:

inputting the acoustic recognition features as an input sequence into a long-short term memory network;

modeling the probability of each vocabulary of the input sequence by adopting a long-short term memory network to obtain the word-level probability of each vocabulary, and determining the fixed size embedding of each vocabulary as a fixed embedding vector based on the word-level probability;

generating a word-by-word embedding vector on the context representation by a context-embedded standard word embedding module with a context-carried neural language model, and embedding by adopting the fixed embedding vector to obtain an enhanced sequence corresponding to the input sequence;

and performing semantic recognition on the enhanced sequence through a mask language model to obtain the voice recognition result.

Optionally, the fusion layer of the mask language model is a transform-XL architecture, and the decoder includes a multi-headed attention layer with residual connections and full connections.

In order to solve the above technical problem, an embodiment of the present application further provides a speech recognition apparatus, including:

the acoustic preprocessing module is used for performing acoustic preprocessing on a voice signal to be recognized to obtain a plurality of voice audio segments;

the transfer learning module is used for carrying out transfer learning on an initial acoustic recognition model by adopting the human voice audio clip to obtain a self-adaptive acoustic recognition model, wherein the initial acoustic recognition model is an acoustic recognition model of a high-resource language;

the acoustic recognition module is used for recognizing the plurality of human voice audio frequency segments according to the self-adaptive acoustic recognition model to obtain acoustic recognition characteristics;

and the semantic recognition module is used for carrying out semantic recognition on the acoustic recognition characteristics based on the context neural language model to obtain a voice recognition result.

Optionally, the acoustic preprocessing module comprises:

the voice extraction submodule is used for extracting voice of the voice signal to be recognized to obtain a first voice signal;

the recognition training sub-module is used for recognizing and training the first human voice signal by adopting an acoustic pre-training model to obtain a second human voice signal;

and the segmentation alignment submodule is used for segmenting and aligning the second voice signal based on an end-to-end network of CTC to obtain a plurality of voice audio segments.

Optionally, the segmentation alignment sub-module includes:

a signal framing unit, configured to perform framing processing on the second voice signal to obtain a voice frame signal;

a signal alignment unit, configured to determine, for each voice frame signal, an alignment manner of a transcribed text corresponding to each audio frame signal and the second human voice signal based on the end-to-end network of CTCs;

the segmentation position determining unit is used for determining segmentation positions according to the alignment mode;

and the signal dividing unit is used for dividing the second voice signal through the dividing position to obtain a plurality of voice audio signals.

Optionally, the segmentation position determination unit includes:

a signal identification subunit, configured to identify the voice frame signal based on the end-to-end network of the CTC, so as to obtain a character posterior probability corresponding to the voice frame signal;

the dynamic programming subunit is used for calculating the posterior probabilities of all the characters in a dynamic programming mode to obtain the maximum joint probability;

and the signal alignment subunit is used for determining the alignment mode of the transcribed text corresponding to the speech frame signal and the second human voice signal based on the maximum joint probability.

Optionally, the migration learning module includes:

the acoustic feature extraction sub-module is used for inputting the human voice audio clip into the initial acoustic recognition model to perform acoustic feature extraction to obtain a first acoustic feature;

and the migration countermeasure training sub-module is used for performing migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature corresponding to the high-resource language corpus, the first acoustic feature and a gradient inversion layer to obtain the self-adaptive acoustic recognition model, wherein the gradient inversion layer comprises a DNN structural feature extractor, a multi-element phoneme classifier and a domain classifier based on a learning acoustic model.

Optionally, the migration countermeasure training sub-module comprises:

a feature extraction unit, configured to perform feature extraction on the first acoustic feature and the second acoustic feature respectively based on a DNN structural feature extractor of a learned acoustic model, to obtain a first structural feature corresponding to the first acoustic feature and a second structural feature corresponding to the second acoustic feature;

a first label generating unit, configured to classify phonemes by using the multi-element phoneme classifier according to the first structural feature and the second structural feature, and generate a phoneme label according to a classification result;

a second label generating unit, configured to perform domain classification by using the domain classifier according to the first structural feature and the second structural feature, and generate a domain label according to a classification result;

and the model training unit is used for carrying out migration countermeasure training on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature and the second structural feature to obtain the self-adaptive acoustic recognition model.

Optionally, the domain label includes a source domain and a target domain, and the model training unit includes:

the public feature extraction subunit is used for performing public feature extraction and private feature extraction on the first structural features and the second structural features of the tags in different fields by adopting an encoder of a domain separation network to obtain public features and private features corresponding to each field tag, wherein the domain separation network comprises a target domain private encoder, a source domain private encoder, a shared decoder and a classifier;

the classifier training subunit is used for decoding a sample consisting of private features and public features through a shared decoder of a domain separation network, and performing classification training on a source domain sample after decoding by adopting a classifier to obtain a trained classifier, wherein each private feature and each public feature comprise different phoneme labels;

the data classification subunit is used for classifying the target domain data by adopting the trained classifier to obtain a target classification;

and the migration countermeasure training subunit is used for performing migration countermeasure training by adopting the maximum mean difference as a loss function according to the common characteristics corresponding to the target classification to obtain the self-adaptive acoustic recognition model.

Optionally, the semantic recognition module includes:

the characteristic input submodule is used for inputting the acoustic recognition characteristics into the long-term and short-term memory network as an input sequence;

the vector calculation submodule is used for modeling the probability of each vocabulary of the input sequence by adopting a long-term and short-term memory network to obtain the word-level probability of each vocabulary, and determining the fixed size embedding of each vocabulary as a fixed embedding vector based on the word-level probability;

the sequence enhancement submodule is used for generating word-by-word embedded vectors on the context representation based on a context embedded standard word embedding module with a context-carried neural language model, and embedding by adopting the fixed embedded vectors to obtain an enhancement sequence corresponding to the input sequence;

and the semantic recognition submodule is used for carrying out semantic recognition on the enhanced sequence through a mask language model to obtain the voice recognition result.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the above speech recognition method when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above speech recognition method.

According to the voice recognition method, the voice recognition device, the computer equipment and the storage medium, acoustic preprocessing is carried out on a voice signal to be recognized to obtain a plurality of voice audio fragments, the voice audio fragments are adopted to carry out transfer learning on an initial acoustic recognition model to obtain an adaptive acoustic recognition model, the initial acoustic recognition model is an acoustic recognition model of high-resource language, the voice audio fragments are recognized according to the adaptive acoustic recognition model to obtain acoustic recognition characteristics, semantic recognition is carried out on the acoustic recognition characteristics based on a context neural language model to obtain a voice recognition result, voice information of low-resource language is extracted from a complex environment, the semantic recognition is carried out on the voice information of the low-resource language, and the accuracy of the voice recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow diagram of one embodiment of a speech recognition method of the present application;

FIG. 2 is a schematic block diagram of one embodiment of a speech recognition apparatus according to the present application;

FIG. 3 is a block diagram of one embodiment of a computer device according to the present application;

FIG. 4 is a block diagram of an unsupervised domain adaptive baseline scheme with a gradient inversion layer for the speech recognition method of the present application;

fig. 5 is a diagram of a domain separation network structure of an embodiment of a speech recognition method of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 shows a speech recognition method according to an embodiment of the present invention, which is detailed as follows:

s201: and performing acoustic preprocessing on the voice signal to be recognized to obtain a plurality of voice audio segments.

In particular, noise problems are often always associated with speech recognition scenarios. In general, where speech recognition techniques are used, hardware and environmental non-uniformity makes white noise inevitably always picked up by the microphone. And the final recognition result is influenced by factors such as the fact that the speaker speaks according to the easily understood speaking rule of the machine, such as the accent problem and different voices of the speaker. In this embodiment, a plurality of voice audio segments are obtained by performing acoustic preprocessing on a voice signal to be recognized.

The acoustic preprocessing method includes, but is not limited to, using a filter to perform threshold limiting and peak smoothing processing on the audio, where the filter may use a suitable filter according to language features, and functions to reduce the influence of white noise caused by non-uniformity of hardware devices. The cleaner audio output by the filter will then use the acoustic and training models to perform the human voice extraction task. Preferably, in the embodiment, the Spleeter model is used as an acoustic pre-training model to extract human voice and music voice.

carrying out voice extraction on a voice signal to be recognized to obtain a first voice signal;

Optionally, the voice extraction is performed by means of decibel filtering and/or spectrum detection.

Further, the step of segmenting and aligning the second voice signal based on the end-to-end CTC network to obtain a plurality of voice audio segments includes:

framing the second voice signal to obtain a voice frame signal;

aiming at each voice frame signal, determining an alignment mode of a transcribed text corresponding to each audio frame signal and the second voice signal based on an end-to-end network of CTCs;

determining a segmentation position according to an alignment mode;

Further, for each voice frame signal, determining an alignment manner of a transcribed text corresponding to each audio frame signal and the second human voice signal based on an end-to-end network of CTCs, including:

identifying the voice frame signals based on an end-to-end network of CTC to obtain the character posterior probability corresponding to the voice frame signals;

calculating the posterior probability of all characters by adopting a dynamic programming mode to obtain the maximum joint probability;

Specifically, after the acoustic pre-training model module is completed, the obtained audio is the purest human voice, and a dialogue behavior cutting module stage is entered.

The dialogue action cutting module is provided with a CTC cutting algorithm, which means that a speaking segment is extracted from a voice corpus and the CTC algorithm is carried out to generate an aligned segmentation audio segment similar to the voice corpus. Training a speech recognition system requires alignment between audio and text in order to reduce the complexity of the training.

An end-to-end network cutting alignment algorithm based on CTC is an algorithm for extracting correct audio text alignment under the condition that unknown voice fragments exist in the whole audio. It uses a CTC-based peer-to-peer network that has been trained on data that has been audio-character aligned in advance. For a given audio, CTC web generationFrame-based character posterior probability

From this a posteriori probability, we compute all possible maximum joint probabilities by dynamic programming

To align text to characters

Index

To the frame

：

Consider the maximum joint probability for a certain time segment by taking the most probable assumption in the propagation of two possible audio timelines: the time point consumes only one blank symbol or the time point consumes the next character. Which means that the conversion cost of staying at the first character in the propagation is set to zero in order to align the transcription start point of the speech file with an arbitrary point. The character alignment is then calculated by backtracking, starting from the most likely time position of the last character in the transcription, the formula can be rewritten as:

wherein the propagation transition with the highest probability for the time segment is assumed to be

，

For the segment of audio, audio frames are then determined

And the alignment of the corresponding characters in their text.

Since the algorithm produces for each audio frame a probability of alignment in a given way

Thus, a confidence score is derived for each audio paragraph

To classify utterances with a deviation between speech and corresponding text, the following is calculated:

and

wherein first an audio frame corresponding to a given audio training set pronunciation is divided into audio frames of lengths of

Part (c) of (a). For each part, based on the frame-by-frame probability

Calculating the mean value

. The total probability segment for a given utterance is defined as each part

The probability minimum of (c). For each character position

Considering only the spacing

The audio frame of (a) is selected,

as a function of the given character position and frame number

The proportional audio position, i.e. the position of the speech paragraph at which the audio position is to be cut.

The algorithm uses a CTC-based end-to-end neural network to extract speech segments with precise time alignment, and after cutting out the exact audio position, it has the effect of deriving the natural passage of the natural human utterance, determining the exact start and end positions of a sentence.

After the whole preprocessing module is completed, a plurality of audio clips with time information are obtained.

S202: and (3) carrying out transfer learning on the initial acoustic recognition model by adopting the human voice audio clip to obtain the self-adaptive acoustic recognition model, wherein the initial acoustic recognition model is an acoustic recognition model of high-resource language.

Specifically, the purpose of unsupervised domain adaptation under the acoustic model is to have low resource languages share the same acoustic space as high resource local area networks. The high-resource language often has sufficient resources to support a robust machine learning project, and the machine learning project of the high-resource language can be suitable for the low-resource language through transfer learning.

The method presupposes that the acoustic space spanned by the low resource language must be kept as a subspace of the high resource language, which requires that the phone set of the low resource language is a subset of the high resource language and that the high resource language side is assumed to have sufficient text and transcription data and that the low resource language side has limited resources.

Optionally, migration learning is performed on the initial acoustic recognition model by using a human voice audio clip, so as to obtain an adaptive acoustic recognition model, including:

inputting the voice audio clip into an initial acoustic recognition model for acoustic feature extraction to obtain a first acoustic feature;

carrying out migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature, a first acoustic feature and a gradient inversion layer corresponding to the high-resource language corpus to obtain a self-adaptive acoustic recognition model, wherein the gradient inversion layer comprises a DNN structural feature extractor, a multi-element phoneme classifier and a domain classifier based on the learning acoustic model.

Further, carrying out migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature, a first acoustic feature and a gradient inversion layer corresponding to the high-resource language corpus to obtain an adaptive acoustic recognition model, comprising:

aiming at the first structural feature and the second structural feature, adopting a multi-element phoneme classifier to classify phonemes, and generating phoneme labels according to a classification result;

performing domain classification by using a domain classifier aiming at the first structural feature and the second structural feature, and generating a domain label according to a classification result;

and carrying out migration countermeasure training on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature and the second structural feature to obtain the self-adaptive acoustic recognition model.

Further, the domain label includes a source domain and a target domain, and the migration countermeasure training is performed on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature and the second structural feature, and the obtaining of the adaptive acoustic recognition model includes:

performing public feature extraction and private feature extraction on first structural features and second structural features of tags in different fields by adopting an encoder of a domain separation network to obtain public features and private features corresponding to the tags in each field, wherein the domain separation network comprises a target domain private encoder, a source domain private encoder, a shared decoder and a classifier, and each private feature and each public feature comprise different phoneme tags;

Specifically, the unsupervised domain adaptive method of acoustic modeling requires the construction of Deep Neural Network (DNN) hidden markov models and the application to speech recognition systems, and unsupervised domain adaptive frameworks composed of Gradient inversion layers (GRLs). After the hidden markov speech recognition model is constructed, the GRL will be used for antagonism training.

In the GRL-based confrontation training, the method proposes a feature expression which is invariant with the field to form: the method comprises a DNN structural feature extractor, a Senone classifier and a Domain classifier based on a learning acoustic model, and is shown as a block diagram of an unsupervised Domain self-adaptive basic scheme with a gradient inversion layer (GRL) in FIG. 4.

Wherein

、

And

representing the parameters of the feature extractor, senone classifier and Domain classifier, respectively.

Mining methodSign extractor

Inputting acoustic features

Mapping to an internal representation

. Senone classifier

Mapping the output of the feature extractor to senone tags

While classifier

Mapping them to domain tags

. By optimizing the feature extractor

And senone classifier

The network is trained to minimize senone classification loss during the training phase. This enables the network to find a function that can distinguish the senone tags. In order to make the feature region invariant, the parameters of the feature extractor are optimized to maximize the region classification loss. At the same time, to Domain classifier

Is optimized to minimize classification loss by performing a feature extractor

And Domain classifier

Introducing GRL. In the forward pass, the GRL takes the gradient from subsequent layers during the backward pass. During the inference process, the Domain classifier and GRL are ignored. The acoustic feature vector passes through the feature extractor and senone classifier and predicts the senone label.

All domains have public (Shared) and Private (Private) features, and if the Private features of each domain are also migrated, negative (negative) migration will occur.

Next, the method proposes Domain Separation Networks (DSNs). Meanwhile, the assumption of the problem by the authors is that unsupervised transfer learning is assumed, that is, the target domain data is unmarked and the source domain data is marked. The main work of the domain separation network is divided into two parts: (1) And (2) extracting common features among different domains, and migrating by using the common features. The overall network structure is shown in fig. 5, which is a structure diagram of a domain separation network of the embodiment in fig. 5.

Wherein, the main structure of the domain separation network is a structure similar to a self-encoder. The whole network structure can be divided into the following five parts:

Private Target Encoder

a target domain private encoder for extracting the private features of the target domain.

Private Source Encoder

A source domain private encoder to extract private features of the source domain.

Shared Encoder

Shared encoder for extracting common features of source domain and target domain.

Shared Decoder

Shared decoder, which is used to decode the samples composed of private characteristic and public characteristic.

Classifier

The classifier is used for classifying the source domain samples during training and can be directly used for classifying the target domain data after the training is finished.

Starting from the structure of the encoder, the source domain samples

First enter into

And

then the two encoders output separately

And

these two outputs correspond to private and public features in the source domain data, respectively, and, similarly,

and

corresponding to private and public features in the target domain data, respectively. It can be known from the above principle that if the domain separation network is to achieve better effect, the public part must be the feature common to the real source domain and the target domain, and the public part and the private part must be completely separated to effectively avoid negative migration, and the following two loss functions are used to constrain

，

And

。

the difference loss is a function one of which is used to constrain each encoder. Since the domain classifier cannot share the domain representing the predicted sample according to it. Parameters of classifiers when sharing the parameters of the encoder, the GRL module trains to minimize the domain loss function to maximize the domain classification loss. So that the migration can be performed after the loss function is derived, i.e. both are migrated

And

and the matching is carried out, and the similarity of the two is improved in a visual embodiment.

The second loss function to be used is then the Maximum Mean Difference (MMD), which is often used in transfer learning. This loss is all measured

And

the magnitude of the difference. Ensure that

And

and

completely separate, at the same time

And

when the distribution of the data is approximately equal, the integrity of the source domain data and the target domain data needs to be ensured. The 'coding-decoding' structure can ensure the extraction of public and private characteristics and the integrity and effectiveness of the characteristics, namely reconstruction loss

And the mean square error:

for source domain samples, shared Decoder

Is decoded to obtain

And it is desirable to minimize reconstruction loss during training. In the process that the source domain samples enter the encoder and come out of the decoder, the two decoders respectively extract the private characteristics and the public characteristics, and meanwhile, the decoder requires the combination of the public characteristics and the private characteristics to form a complete source domain sample, so that the integrity of the characteristics is guaranteed. Similarly, when the target domain sample entersThe same is true.

Finally, the coding-decoding unsupervised method extracts the public characteristics and the private characteristics of the source domain and the target domain, and still needs to classify the samples, so a classifier is added

. In the previous process, the self-encoder has extracted the common part of the source and target domains and approximated the two in distribution by minimizing the similarity function. It can be said that a classifier is valid on the common part of the source domain and is valid on the common part of the source domain of the target domain, so that the classifier only needs to be trained using labeled source domain data when training, and can be directly applied on the target domain when training is completed.

S203: and identifying a plurality of human voice audio frequency segments according to the self-adaptive acoustic identification model to obtain acoustic identification characteristics.

S204: and performing semantic recognition on the acoustic recognition features based on the context neural language model to obtain a voice recognition result.

Optionally, semantically recognizing the acoustic recognition features by the context-based neural language model, and obtaining a speech recognition result includes:

inputting the acoustic recognition features as an input sequence into a long-term and short-term memory network;

modeling the probability of each vocabulary of the input sequence by adopting a long-term and short-term memory network to obtain the word-level probability of each vocabulary, and determining the fixed size embedding of each vocabulary as a fixed embedding vector based on the word-level probability;

generating a word-by-word embedding vector on the context representation by a context-embedded standard word embedding module with a context-carried neural language model, and embedding by adopting a fixed embedding vector to obtain an enhanced sequence corresponding to an input sequence;

and performing semantic recognition on the enhanced sequence through a mask language model to obtain a voice recognition result.

Optionally, the fusion layer of the mask language model is a transform-XL architecture, and the decoder includes a multi-headed attention layer with residual concatenation and a full concatenation.

Specifically, first, the probability of each word of an input sequence in a speech recognition system is modeled based on the neural language model of LSTM. The end result is to derive a fixed size embedding for each word, then project to the word-level output using the output from the LSTM, then convert the word-level output to the final word-level probabilities using the Softmax layer.

Next, feature enhancement based on context embedding is performed. However, the neural language model with Context Carrying (CCO) based on context embedding has the problems of excessive concentration of nearby contexts or fuzzy understanding of far-end of contexts, and the method can perform input feature enhancement so as to use various context embedding. Average word embedding is performed first, namely, a standard word embedding module generates word-by-word embedding vectors on context representation. Then, fixed-length embedding is performed, and the fixed-length embedding vector is obtained by averaging the embedded vectors over the whole context sequence.

Next, training will be performed using the transform-XL model architecture of the Masked Language Model (MLM) fusion layer. When a new segment is processed as an extended context, the previous segment is reused. The final model consists of a stack of decoder modules, where each module consists of a multi-head attention (MHA) layer with residual connections and a set of Fully Connected (FC) layers, and finally linear and soft maximum layers. The MHA layer adopts an upper triangular mask to ensure unidirectionality in decoding. Then, fusion of MLM and Transformer-XL was performed.

Semantic embedding and transform-XL fusion derived from Mask Language Model (MLM) can be divided into early fusion and late fusion. Early Fusion (Early Fusion) explores the Fusion of input modes in combination with input word embedding and BERT-derived embedding at each time frame. In this fused form, the hidden states from the last layer of the transform-XL decoder are connected to the BERT derived embedded connection layer. Late Fusion (Late Fusion) training adjusts the flow of information between transform-X and BERT-derived embeddings. The samples of the MLM in this embedded domain pass the output of the fusion layer to the linear layer, which in turn predicts the next word in the sequence by softmax.

The model has the effects that the model can effectively adapt to rich semantic information concentrated in BERT embedding generated by a voice mode provided by a user, and can remarkably improve a neural language model in a task-oriented dialog system by combining context carried information and features.

In the embodiment, acoustic preprocessing is performed on a voice signal to be recognized to obtain a plurality of voice audio fragments, the voice audio fragments are adopted to perform transfer learning on an initial acoustic recognition model to obtain an adaptive acoustic recognition model, the initial acoustic recognition model is an acoustic recognition model of a high-resource language, the voice audio fragments are recognized according to the adaptive acoustic recognition model to obtain acoustic recognition characteristics, semantic recognition is performed on the acoustic recognition characteristics based on a context neural language model to obtain a voice recognition result, extraction of voice information of a low-resource language from a complex environment is achieved, semantic recognition is performed on the voice information of the low-resource language, and accuracy of the voice recognition is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Fig. 2 shows a schematic block diagram of a speech recognition apparatus in one-to-one correspondence with the speech recognition methods of the above-described embodiments. As shown in fig. 2, the speech recognition apparatus includes an acoustic preprocessing module 31, a migration learning module 32, an acoustic recognition module 33, and a semantic recognition module 34. The detailed description of each functional module is as follows:

the acoustic preprocessing module 31 is configured to perform acoustic preprocessing on a voice signal to be recognized to obtain a plurality of voice audio segments;

the transfer learning module 32 is configured to perform transfer learning on the initial acoustic recognition model by using the human voice audio clip to obtain an adaptive acoustic recognition model, where the initial acoustic recognition model is an acoustic recognition model of a high-resource language;

the acoustic recognition module 33 is configured to recognize the plurality of human voice audio segments according to the adaptive acoustic recognition model to obtain acoustic recognition features;

and the semantic recognition module 34 is configured to perform semantic recognition on the acoustic recognition features based on the contextual neural language model to obtain a speech recognition result.

Optionally, the acoustic preprocessing module 31 includes:

the recognition training submodule is used for recognizing and training the first human voice signal by adopting an acoustic pre-training model to obtain a second human voice signal;

and the segmentation alignment submodule is used for segmenting and aligning the second voice signal based on the CTC end-to-end network to obtain a plurality of voice audio segments.

Optionally, the partitioning alignment sub-module comprises:

the signal framing unit is used for framing the second voice signal to obtain a voice frame signal;

the signal alignment unit is used for determining the alignment mode of the transcribed text corresponding to each audio frame signal and the second human voice signal based on the CTC end-to-end network aiming at each voice frame signal;

a division position determining unit for determining a division position according to the alignment mode;

Optionally, the segmentation position determination unit includes:

the signal identification subunit is used for identifying the voice frame signals based on an end-to-end network of CTC to obtain the character posterior probability corresponding to the voice frame signals;

and the signal alignment subunit is used for determining the alignment mode of the speech frame signal and the transcribed text corresponding to the second human voice signal based on the maximum joint probability.

Optionally, the migration learning module 32 includes:

the acoustic feature extraction submodule is used for inputting the human voice audio clip into the initial acoustic recognition model to carry out acoustic feature extraction so as to obtain a first acoustic feature;

and the migration countermeasure training sub-module is used for performing migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature, a first acoustic feature and a gradient inversion layer corresponding to the high-resource language corpus to obtain a self-adaptive acoustic recognition model, wherein the gradient inversion layer comprises a DNN structural feature extractor, a multi-element phoneme classifier and a domain classifier based on the learning acoustic model.

Optionally, the migration countermeasure training sub-module comprises:

the characteristic extraction unit is used for respectively extracting the characteristics of the first acoustic characteristic and the second acoustic characteristic based on a DNN structural characteristic extractor of a learning acoustic model to obtain a first structural characteristic corresponding to the first acoustic characteristic and a second structural characteristic corresponding to the second acoustic characteristic;

the first label generating unit is used for classifying phonemes by adopting a multi-element phoneme classifier according to the first structural feature and the second structural feature and generating phoneme labels according to a classification result;

the second label generating unit is used for carrying out domain classification by adopting a domain classifier aiming at the first structural feature and the second structural feature and generating a domain label according to a classification result;

the public feature extraction subunit is used for performing public feature extraction and private feature extraction on the first structural features and the second structural features of the tags in different fields by adopting an encoder of a domain separation network to obtain public features and private features corresponding to the tags in each field, wherein the domain separation network comprises a target domain private encoder, a source domain private encoder, a shared decoder and a classifier;

the classifier training subunit is used for decoding a sample consisting of the private features and the public features through a shared decoder of the domain separation network, and performing classification training on the decoded source domain sample by adopting a classifier to obtain a trained classifier, wherein each private feature and each public feature comprise different phoneme labels;

Optionally, the semantic recognition module 34 includes:

the characteristic input submodule is used for inputting the acoustic recognition characteristics serving as an input sequence into the long-term and short-term memory network;

the sequence enhancement submodule is used for generating word-by-word embedded vectors on the context representation based on a context embedded standard word embedding module with a context-carried neural language model, and embedding the vectors by adopting fixed embedded vectors to obtain an enhancement sequence corresponding to the input sequence;

and the semantic recognition submodule is used for performing semantic recognition on the enhanced sequence through the mask language model to obtain a voice recognition result.

For the specific limitations of the speech recognition device, reference may be made to the above limitations of the speech recognition method, which are not described herein again. The respective modules in the above-described speech recognition apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 3 in detail, fig. 3 is a block diagram of a basic structure of a computer device according to the embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only the computer device 4 having the components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing control of an electronic file.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing an interface display program, which is executable by at least one processor to cause the at least one processor to perform the steps of the speech recognition method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A speech recognition method, characterized in that the speech recognition method comprises:

performing acoustic pretreatment on a voice signal to be recognized to obtain a plurality of voice audio segments;

adopting the voice audio clip to perform transfer learning on an initial acoustic recognition model to obtain a self-adaptive acoustic recognition model, wherein the initial acoustic recognition model is an acoustic recognition model of a high-resource language;

identifying the plurality of voice audio segments according to the self-adaptive acoustic identification model to obtain acoustic identification characteristics;

performing semantic recognition on the acoustic recognition features based on a context neural language model to obtain a voice recognition result;

the method for performing transfer learning on the initial acoustic recognition model by using the human voice audio clip to obtain the self-adaptive acoustic recognition model comprises the following steps:

inputting the voice audio clip into the initial acoustic recognition model for acoustic feature extraction to obtain a first acoustic feature;

and performing migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature corresponding to the high-resource language corpus, the first acoustic feature and a gradient inversion layer to obtain the self-adaptive acoustic recognition model, wherein the gradient inversion layer comprises a DNN structural feature extractor based on a learned acoustic model, a multi-element phoneme classifier and a domain classifier.

2. The speech recognition method of claim 1, wherein the acoustic preprocessing of the speech signal to be recognized to obtain a plurality of human voice audio segments comprises:

3. The speech recognition method of claim 2, wherein the voice extraction is performed by means of decibel filtering and/or spectral detection.

4. The speech recognition method of claim 2, wherein the CTC-based end-to-end network segmenting and aligning the second vocal signal into a number of the vocal audio segments comprises:

determining a segmentation position according to the alignment mode;

and segmenting the second voice signal through the segmentation position to obtain a plurality of voice audio clips.

5. The speech recognition method of claim 4, wherein said determining, for each of the speech frame signals, an alignment of the transcribed text corresponding to each audio frame signal and the second human voice signal based on an end-to-end network of CTCs comprises:

and determining the alignment mode of the transcribed text corresponding to the speech frame signal and the second human voice signal based on the maximum joint probability.

6. A speech recognition method according to any one of claims 1 to 5, wherein the initial acoustic recognition model is a hidden Markov speech recognition model.

7. The speech recognition method according to claim 1, wherein the performing migration countermeasure training on the initial acoustic recognition model by using the second acoustic feature, the first acoustic feature and the gradient inversion layer corresponding to the high-resource language corpus to obtain the adaptive acoustic recognition model comprises:

aiming at the first structural feature and the second structural feature, adopting the multi-element phoneme classifier to classify phonemes, and generating phoneme labels according to classification results;

8. The speech recognition method of claim 7, wherein the domain labels comprise a source domain and a target domain, and wherein performing a migration countermeasure training on the initial acoustic recognition model based on the domain labels, the phoneme labels, the first structural features, and the second structural features to obtain the adaptive acoustic recognition model comprises:

9. The speech recognition method of any one of claims 1 to 5, wherein the context-based neural language model semantically recognizes the acoustic recognition features, and obtaining the speech recognition result comprises:

10. The speech recognition method of claim 9, wherein the fusion layer of the masking language model is a transform-XL architecture, and the decoder comprises a multi-headed attention layer with residual connections and full connections.

11. A speech recognition apparatus, characterized in that the speech recognition apparatus comprises:

the semantic recognition module is used for carrying out semantic recognition on the acoustic recognition characteristics based on the context neural language model to obtain a voice recognition result;

wherein the migration learning module comprises:

12. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the speech recognition method according to any one of claims 1 to 10 when executing the computer program.

13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the speech recognition method according to any one of claims 1 to 10.