CN114999463B - Voice recognition method, device, equipment and medium - Google Patents

Voice recognition method, device, equipment and medium Download PDF

Info

Publication number
CN114999463B
CN114999463B CN202210916222.2A CN202210916222A CN114999463B CN 114999463 B CN114999463 B CN 114999463B CN 202210916222 A CN202210916222 A CN 202210916222A CN 114999463 B CN114999463 B CN 114999463B
Authority
CN
China
Prior art keywords
acoustic
voice
model
recognition
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210916222.2A
Other languages
Chinese (zh)
Other versions
CN114999463A (en
Inventor
古树桦
林余楚
周艺恺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyi Information Technology Zhuhai Co ltd
Original Assignee
Shenyi Information Technology Zhuhai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyi Information Technology Zhuhai Co ltd filed Critical Shenyi Information Technology Zhuhai Co ltd
Priority to CN202210916222.2A priority Critical patent/CN114999463B/en
Publication of CN114999463A publication Critical patent/CN114999463A/en
Application granted granted Critical
Publication of CN114999463B publication Critical patent/CN114999463B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a voice recognition method, a device, equipment and a medium, comprising the following steps: the method comprises the steps of performing acoustic preprocessing on a voice signal to be recognized to obtain a plurality of voice audio segments, performing transfer learning on an initial acoustic recognition model by adopting the voice audio segments to obtain a self-adaptive acoustic recognition model, wherein the initial acoustic recognition model is an acoustic recognition model of high-resource language, recognizing the plurality of voice audio segments according to the self-adaptive acoustic recognition model to obtain acoustic recognition characteristics, performing semantic recognition on the acoustic recognition characteristics based on a context neural language model to obtain a voice recognition result, and the accuracy of voice recognition can be improved by adopting the method.

Description

Speech recognition method, apparatus, device and medium
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, device, and medium.
Background
The speech recognition technology has gradually matured, and has reached a practical level in a specific field and a specific environment. However, under the circumstances of free sounding, high noise, simultaneous sounding, far-end sound field, etc., the performance of machine recognition is still far from satisfactory.
For the above uncertain factors, the existing speech recognition technology has the following negative effects: noise, different speakers and personal accent problems in a free pronunciation scene still influence the voice recognition technology by serving as input modes; the recognition performance of the dialect language and the dialect language with scarce resources can not complement the complement of the voice recognition technology by the data amount, so that the voice semantic recognition effect of the speaker is poor.
Disclosure of Invention
The embodiment of the invention provides a voice recognition method, a voice recognition device, computer equipment and a storage medium, which are used for improving the accuracy of semantic recognition.
In order to solve the foregoing technical problem, an embodiment of the present application provides a speech recognition method, which is applied to a distributed cluster, where the distributed cluster includes a plurality of node servers, and the speech recognition method includes:
carrying out acoustic preprocessing on a voice signal to be recognized to obtain a plurality of voice audio segments;
adopting the human voice audio clip to perform transfer learning on an initial acoustic recognition model to obtain a self-adaptive acoustic recognition model, wherein the initial acoustic recognition model is an acoustic recognition model of a high-resource language;
recognizing the plurality of voice audio segments according to the self-adaptive acoustic recognition model to obtain acoustic recognition characteristics;
and performing semantic recognition on the acoustic recognition features based on a context neural language model to obtain a voice recognition result.
Optionally, the acoustic preprocessing is performed on the speech signal to be recognized to obtain a plurality of human voice audio segments, including:
carrying out voice extraction on the voice signal to be recognized to obtain a first voice signal;
carrying out recognition training on the first human voice signal by adopting an acoustic pre-training model to obtain a second human voice signal;
and based on the CTC end-to-end network, segmenting and aligning the second voice signal to obtain a plurality of voice audio segments.
Optionally, the voice extraction is performed by means of decibel filtering and/or frequency spectrum detection.
Optionally, the end-to-end CTC-based network segmenting and aligning the second voice signal to obtain a plurality of voice audio segments includes:
performing framing processing on the second voice signal to obtain a voice frame signal;
for each voice frame signal, determining an alignment mode of a transcribed text corresponding to each audio frame signal and the second voice signal based on the end-to-end network of CTCs;
determining a segmentation position according to the alignment mode;
and segmenting the second voice signal through the segmentation position to obtain a plurality of voice audio signals.
Optionally, the determining, for each of the voice frame signals, an alignment manner of the transcribed text corresponding to each audio frame signal and the second human voice signal based on the end-to-end network of CTCs includes:
identifying the voice frame signal based on the end-to-end network of the CTC to obtain the character posterior probability corresponding to the voice frame signal;
calculating the posterior probabilities of all the characters by adopting a dynamic programming mode to obtain the maximum joint probability;
and determining the alignment mode of the speech frame signal and the transcribed text corresponding to the second human voice signal based on the maximum joint probability.
Optionally, the initial acoustic recognition model is a hidden markov speech recognition model.
Optionally, the performing transfer learning on the initial acoustic recognition model by using the human voice audio segment to obtain an adaptive acoustic recognition model includes:
inputting the human voice audio clip into the initial acoustic recognition model for acoustic feature extraction to obtain a first acoustic feature;
and performing migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature corresponding to the high-resource language corpus, the first acoustic feature and a gradient inversion layer to obtain the self-adaptive acoustic recognition model, wherein the gradient inversion layer comprises a DNN structural feature extractor, a multi-element phoneme classifier and a domain classifier based on a learning acoustic model.
Optionally, the performing migration countermeasure training on the initial acoustic recognition model by using a second acoustic feature, the first acoustic feature, and a gradient inversion layer corresponding to the high-resource language corpus to obtain the adaptive acoustic recognition model includes:
a DNN structural feature extractor based on a learning acoustic model is used for respectively extracting the first acoustic feature and the second acoustic feature to obtain a first structural feature corresponding to the first acoustic feature and a second structural feature corresponding to the second acoustic feature;
aiming at the first structural feature and the second structural feature, adopting the multi-element phoneme classifier to classify phonemes, and generating phoneme labels according to a classification result;
for the first structural feature and the second structural feature, performing domain classification by using the domain classifier, and generating a domain label according to a classification result;
and performing migration countermeasure training on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature and the second structural feature to obtain the self-adaptive acoustic recognition model.
Optionally, the domain label includes a source domain and a target domain, and the performing migration countermeasure training on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature, and the second structural feature to obtain the adaptive acoustic recognition model includes:
performing public feature extraction and private feature extraction on first structural features and second structural features of tags in different fields by adopting an encoder of a domain separation network to obtain public features and private features corresponding to each field tag, wherein the domain separation network comprises a target domain private encoder, a source domain private encoder, a shared decoder and a classifier;
decoding a sample consisting of the private characteristics and the public characteristics through a shared decoder of a domain separation network, and performing classification training on a source domain sample after decoding by adopting a classifier to obtain a trained classifier;
classifying the target domain data by adopting the trained classifier to obtain a target classification;
and performing migration countermeasure training by adopting the maximum mean difference as a loss function according to the common characteristics corresponding to the target classification to obtain the self-adaptive acoustic recognition model.
Optionally, the semantic recognition of the acoustic recognition features by the context-based neural language model to obtain a speech recognition result includes:
inputting the acoustic recognition features as an input sequence into a long-short term memory network;
modeling the probability of each vocabulary of the input sequence by adopting a long-short term memory network to obtain the word-level probability of each vocabulary, and determining the fixed size embedding of each vocabulary as a fixed embedding vector based on the word-level probability;
generating a word-by-word embedding vector on the context representation by a context-embedded standard word embedding module with a context-carried neural language model, and embedding by adopting the fixed embedding vector to obtain an enhanced sequence corresponding to the input sequence;
and performing semantic recognition on the enhanced sequence through a mask language model to obtain the voice recognition result.
Optionally, the fusion layer of the mask language model is a transform-XL architecture, and the decoder includes a multi-headed attention layer with residual connections and full connections.
In order to solve the above technical problem, an embodiment of the present application further provides a speech recognition apparatus, including:
the acoustic preprocessing module is used for performing acoustic preprocessing on a voice signal to be recognized to obtain a plurality of voice audio segments;
the transfer learning module is used for carrying out transfer learning on an initial acoustic recognition model by adopting the human voice audio clip to obtain a self-adaptive acoustic recognition model, wherein the initial acoustic recognition model is an acoustic recognition model of a high-resource language;
the acoustic recognition module is used for recognizing the plurality of human voice audio frequency segments according to the self-adaptive acoustic recognition model to obtain acoustic recognition characteristics;
and the semantic recognition module is used for carrying out semantic recognition on the acoustic recognition characteristics based on the context neural language model to obtain a voice recognition result.
Optionally, the acoustic preprocessing module comprises:
the voice extraction submodule is used for extracting voice of the voice signal to be recognized to obtain a first voice signal;
the recognition training sub-module is used for recognizing and training the first human voice signal by adopting an acoustic pre-training model to obtain a second human voice signal;
and the segmentation alignment submodule is used for segmenting and aligning the second voice signal based on an end-to-end network of CTC to obtain a plurality of voice audio segments.
Optionally, the segmentation alignment sub-module includes:
a signal framing unit, configured to perform framing processing on the second voice signal to obtain a voice frame signal;
a signal alignment unit, configured to determine, for each voice frame signal, an alignment manner of a transcribed text corresponding to each audio frame signal and the second human voice signal based on the end-to-end network of CTCs;
the segmentation position determining unit is used for determining segmentation positions according to the alignment mode;
and the signal dividing unit is used for dividing the second voice signal through the dividing position to obtain a plurality of voice audio signals.
Optionally, the segmentation position determination unit includes:
a signal identification subunit, configured to identify the voice frame signal based on the end-to-end network of the CTC, so as to obtain a character posterior probability corresponding to the voice frame signal;
the dynamic programming subunit is used for calculating the posterior probabilities of all the characters in a dynamic programming mode to obtain the maximum joint probability;
and the signal alignment subunit is used for determining the alignment mode of the transcribed text corresponding to the speech frame signal and the second human voice signal based on the maximum joint probability.
Optionally, the migration learning module includes:
the acoustic feature extraction sub-module is used for inputting the human voice audio clip into the initial acoustic recognition model to perform acoustic feature extraction to obtain a first acoustic feature;
and the migration countermeasure training sub-module is used for performing migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature corresponding to the high-resource language corpus, the first acoustic feature and a gradient inversion layer to obtain the self-adaptive acoustic recognition model, wherein the gradient inversion layer comprises a DNN structural feature extractor, a multi-element phoneme classifier and a domain classifier based on a learning acoustic model.
Optionally, the migration countermeasure training sub-module comprises:
a feature extraction unit, configured to perform feature extraction on the first acoustic feature and the second acoustic feature respectively based on a DNN structural feature extractor of a learned acoustic model, to obtain a first structural feature corresponding to the first acoustic feature and a second structural feature corresponding to the second acoustic feature;
a first label generating unit, configured to classify phonemes by using the multi-element phoneme classifier according to the first structural feature and the second structural feature, and generate a phoneme label according to a classification result;
a second label generating unit, configured to perform domain classification by using the domain classifier according to the first structural feature and the second structural feature, and generate a domain label according to a classification result;
and the model training unit is used for carrying out migration countermeasure training on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature and the second structural feature to obtain the self-adaptive acoustic recognition model.
Optionally, the domain label includes a source domain and a target domain, and the model training unit includes:
the public feature extraction subunit is used for performing public feature extraction and private feature extraction on the first structural features and the second structural features of the tags in different fields by adopting an encoder of a domain separation network to obtain public features and private features corresponding to each field tag, wherein the domain separation network comprises a target domain private encoder, a source domain private encoder, a shared decoder and a classifier;
the classifier training subunit is used for decoding a sample consisting of private features and public features through a shared decoder of a domain separation network, and performing classification training on a source domain sample after decoding by adopting a classifier to obtain a trained classifier, wherein each private feature and each public feature comprise different phoneme labels;
the data classification subunit is used for classifying the target domain data by adopting the trained classifier to obtain a target classification;
and the migration countermeasure training subunit is used for performing migration countermeasure training by adopting the maximum mean difference as a loss function according to the common characteristics corresponding to the target classification to obtain the self-adaptive acoustic recognition model.
Optionally, the semantic recognition module includes:
the characteristic input submodule is used for inputting the acoustic recognition characteristics into the long-term and short-term memory network as an input sequence;
the vector calculation submodule is used for modeling the probability of each vocabulary of the input sequence by adopting a long-term and short-term memory network to obtain the word-level probability of each vocabulary, and determining the fixed size embedding of each vocabulary as a fixed embedding vector based on the word-level probability;
the sequence enhancement submodule is used for generating word-by-word embedded vectors on the context representation based on a context embedded standard word embedding module with a context-carried neural language model, and embedding by adopting the fixed embedded vectors to obtain an enhancement sequence corresponding to the input sequence;
and the semantic recognition submodule is used for carrying out semantic recognition on the enhanced sequence through a mask language model to obtain the voice recognition result.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the above speech recognition method when executing the computer program.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above speech recognition method.
According to the voice recognition method, the voice recognition device, the computer equipment and the storage medium, acoustic preprocessing is carried out on a voice signal to be recognized to obtain a plurality of voice audio fragments, the voice audio fragments are adopted to carry out transfer learning on an initial acoustic recognition model to obtain an adaptive acoustic recognition model, the initial acoustic recognition model is an acoustic recognition model of high-resource language, the voice audio fragments are recognized according to the adaptive acoustic recognition model to obtain acoustic recognition characteristics, semantic recognition is carried out on the acoustic recognition characteristics based on a context neural language model to obtain a voice recognition result, voice information of low-resource language is extracted from a complex environment, the semantic recognition is carried out on the voice information of the low-resource language, and the accuracy of the voice recognition is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a flow diagram of one embodiment of a speech recognition method of the present application;
FIG. 2 is a schematic block diagram of one embodiment of a speech recognition apparatus according to the present application;
FIG. 3 is a block diagram of one embodiment of a computer device according to the present application;
FIG. 4 is a block diagram of an unsupervised domain adaptive baseline scheme with a gradient inversion layer for the speech recognition method of the present application;
fig. 5 is a diagram of a domain separation network structure of an embodiment of a speech recognition method of the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 shows a speech recognition method according to an embodiment of the present invention, which is detailed as follows:
s201: and performing acoustic preprocessing on the voice signal to be recognized to obtain a plurality of voice audio segments.
In particular, noise problems are often always associated with speech recognition scenarios. In general, where speech recognition techniques are used, hardware and environmental non-uniformity makes white noise inevitably always picked up by the microphone. And the final recognition result is influenced by factors such as the fact that the speaker speaks according to the easily understood speaking rule of the machine, such as the accent problem and different voices of the speaker. In this embodiment, a plurality of voice audio segments are obtained by performing acoustic preprocessing on a voice signal to be recognized.
The acoustic preprocessing method includes, but is not limited to, using a filter to perform threshold limiting and peak smoothing processing on the audio, where the filter may use a suitable filter according to language features, and functions to reduce the influence of white noise caused by non-uniformity of hardware devices. The cleaner audio output by the filter will then use the acoustic and training models to perform the human voice extraction task. Preferably, in the embodiment, the Spleeter model is used as an acoustic pre-training model to extract human voice and music voice.
Optionally, the acoustic preprocessing is performed on the speech signal to be recognized to obtain a plurality of human voice audio segments, including:
carrying out voice extraction on a voice signal to be recognized to obtain a first voice signal;
carrying out recognition training on the first human voice signal by adopting an acoustic pre-training model to obtain a second human voice signal;
and based on the CTC end-to-end network, segmenting and aligning the second voice signal to obtain a plurality of voice audio segments.
Optionally, the voice extraction is performed by means of decibel filtering and/or spectrum detection.
Further, the step of segmenting and aligning the second voice signal based on the end-to-end CTC network to obtain a plurality of voice audio segments includes:
framing the second voice signal to obtain a voice frame signal;
aiming at each voice frame signal, determining an alignment mode of a transcribed text corresponding to each audio frame signal and the second voice signal based on an end-to-end network of CTCs;
determining a segmentation position according to an alignment mode;
and segmenting the second voice signal through the segmentation position to obtain a plurality of voice audio signals.
Further, for each voice frame signal, determining an alignment manner of a transcribed text corresponding to each audio frame signal and the second human voice signal based on an end-to-end network of CTCs, including:
identifying the voice frame signals based on an end-to-end network of CTC to obtain the character posterior probability corresponding to the voice frame signals;
calculating the posterior probability of all characters by adopting a dynamic programming mode to obtain the maximum joint probability;
and determining the alignment mode of the speech frame signal and the transcribed text corresponding to the second human voice signal based on the maximum joint probability.
Specifically, after the acoustic pre-training model module is completed, the obtained audio is the purest human voice, and a dialogue behavior cutting module stage is entered.
The dialogue action cutting module is provided with a CTC cutting algorithm, which means that a speaking segment is extracted from a voice corpus and the CTC algorithm is carried out to generate an aligned segmentation audio segment similar to the voice corpus. Training a speech recognition system requires alignment between audio and text in order to reduce the complexity of the training.
An end-to-end network cutting alignment algorithm based on CTC is an algorithm for extracting correct audio text alignment under the condition that unknown voice fragments exist in the whole audio. It uses a CTC-based peer-to-peer network that has been trained on data that has been audio-character aligned in advance. For a given audio, CTC web generationFrame-based character posterior probability
Figure 384289DEST_PATH_IMAGE001
From this a posteriori probability, we compute all possible maximum joint probabilities by dynamic programming
Figure 363746DEST_PATH_IMAGE002
To align text to characters
Figure 877904DEST_PATH_IMAGE003
Index
Figure 866589DEST_PATH_IMAGE004
To the frame
Figure 477699DEST_PATH_IMAGE005
Figure 362478DEST_PATH_IMAGE006
Consider the maximum joint probability for a certain time segment by taking the most probable assumption in the propagation of two possible audio timelines: the time point consumes only one blank symbol or the time point consumes the next character. Which means that the conversion cost of staying at the first character in the propagation is set to zero in order to align the transcription start point of the speech file with an arbitrary point. The character alignment is then calculated by backtracking, starting from the most likely time position of the last character in the transcription, the formula can be rewritten as:
Figure 363932DEST_PATH_IMAGE007
wherein the propagation transition with the highest probability for the time segment is assumed to be
Figure 359570DEST_PATH_IMAGE008
Figure 890433DEST_PATH_IMAGE009
For the segment of audio, audio frames are then determined
Figure 680534DEST_PATH_IMAGE010
And the alignment of the corresponding characters in their text.
Since the algorithm produces for each audio frame a probability of alignment in a given way
Figure 497181DEST_PATH_IMAGE011
Thus, a confidence score is derived for each audio paragraph
Figure 968613DEST_PATH_IMAGE012
To classify utterances with a deviation between speech and corresponding text, the following is calculated:
Figure 554316DEST_PATH_IMAGE013
and
Figure 780897DEST_PATH_IMAGE014
wherein first an audio frame corresponding to a given audio training set pronunciation is divided into audio frames of lengths of
Figure 84840DEST_PATH_IMAGE015
Part (c) of (a). For each part, based on the frame-by-frame probability
Figure 219018DEST_PATH_IMAGE011
Calculating the mean value
Figure 659227DEST_PATH_IMAGE016
. The total probability segment for a given utterance is defined as each part
Figure 728814DEST_PATH_IMAGE016
The probability minimum of (c). For each character position
Figure 257403DEST_PATH_IMAGE017
Considering only the spacing
Figure 195272DEST_PATH_IMAGE018
The audio frame of (a) is selected,
Figure 489987DEST_PATH_IMAGE019
as a function of the given character position and frame number
Figure 323951DEST_PATH_IMAGE020
The proportional audio position, i.e. the position of the speech paragraph at which the audio position is to be cut.
The algorithm uses a CTC-based end-to-end neural network to extract speech segments with precise time alignment, and after cutting out the exact audio position, it has the effect of deriving the natural passage of the natural human utterance, determining the exact start and end positions of a sentence.
After the whole preprocessing module is completed, a plurality of audio clips with time information are obtained.
S202: and (3) carrying out transfer learning on the initial acoustic recognition model by adopting the human voice audio clip to obtain the self-adaptive acoustic recognition model, wherein the initial acoustic recognition model is an acoustic recognition model of high-resource language.
Specifically, the purpose of unsupervised domain adaptation under the acoustic model is to have low resource languages share the same acoustic space as high resource local area networks. The high-resource language often has sufficient resources to support a robust machine learning project, and the machine learning project of the high-resource language can be suitable for the low-resource language through transfer learning.
The method presupposes that the acoustic space spanned by the low resource language must be kept as a subspace of the high resource language, which requires that the phone set of the low resource language is a subset of the high resource language and that the high resource language side is assumed to have sufficient text and transcription data and that the low resource language side has limited resources.
Optionally, the initial acoustic recognition model is a hidden markov speech recognition model.
Optionally, migration learning is performed on the initial acoustic recognition model by using a human voice audio clip, so as to obtain an adaptive acoustic recognition model, including:
inputting the voice audio clip into an initial acoustic recognition model for acoustic feature extraction to obtain a first acoustic feature;
carrying out migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature, a first acoustic feature and a gradient inversion layer corresponding to the high-resource language corpus to obtain a self-adaptive acoustic recognition model, wherein the gradient inversion layer comprises a DNN structural feature extractor, a multi-element phoneme classifier and a domain classifier based on the learning acoustic model.
Further, carrying out migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature, a first acoustic feature and a gradient inversion layer corresponding to the high-resource language corpus to obtain an adaptive acoustic recognition model, comprising:
a DNN structural feature extractor based on a learning acoustic model is used for respectively extracting the first acoustic feature and the second acoustic feature to obtain a first structural feature corresponding to the first acoustic feature and a second structural feature corresponding to the second acoustic feature;
aiming at the first structural feature and the second structural feature, adopting a multi-element phoneme classifier to classify phonemes, and generating phoneme labels according to a classification result;
performing domain classification by using a domain classifier aiming at the first structural feature and the second structural feature, and generating a domain label according to a classification result;
and carrying out migration countermeasure training on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature and the second structural feature to obtain the self-adaptive acoustic recognition model.
Further, the domain label includes a source domain and a target domain, and the migration countermeasure training is performed on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature and the second structural feature, and the obtaining of the adaptive acoustic recognition model includes:
performing public feature extraction and private feature extraction on first structural features and second structural features of tags in different fields by adopting an encoder of a domain separation network to obtain public features and private features corresponding to the tags in each field, wherein the domain separation network comprises a target domain private encoder, a source domain private encoder, a shared decoder and a classifier, and each private feature and each public feature comprise different phoneme tags;
decoding a sample consisting of the private characteristics and the public characteristics through a shared decoder of a domain separation network, and performing classification training on a source domain sample after decoding by adopting a classifier to obtain a trained classifier;
classifying the target domain data by adopting the trained classifier to obtain a target classification;
and performing migration countermeasure training by adopting the maximum mean difference as a loss function according to the common characteristics corresponding to the target classification to obtain the self-adaptive acoustic recognition model.
Specifically, the unsupervised domain adaptive method of acoustic modeling requires the construction of Deep Neural Network (DNN) hidden markov models and the application to speech recognition systems, and unsupervised domain adaptive frameworks composed of Gradient inversion layers (GRLs). After the hidden markov speech recognition model is constructed, the GRL will be used for antagonism training.
In the GRL-based confrontation training, the method proposes a feature expression which is invariant with the field to form: the method comprises a DNN structural feature extractor, a Senone classifier and a Domain classifier based on a learning acoustic model, and is shown as a block diagram of an unsupervised Domain self-adaptive basic scheme with a gradient inversion layer (GRL) in FIG. 4.
Wherein
Figure 9010DEST_PATH_IMAGE021
Figure 484991DEST_PATH_IMAGE022
And
Figure 634212DEST_PATH_IMAGE023
representing the parameters of the feature extractor, senone classifier and Domain classifier, respectively.
Mining methodSign extractor
Figure 639077DEST_PATH_IMAGE024
Inputting acoustic features
Figure 811433DEST_PATH_IMAGE025
Mapping to an internal representation
Figure 28787DEST_PATH_IMAGE026
. Senone classifier
Figure 566604DEST_PATH_IMAGE027
Mapping the output of the feature extractor to senone tags
Figure 273528DEST_PATH_IMAGE028
While classifier
Figure 933180DEST_PATH_IMAGE029
Mapping them to domain tags
Figure 219805DEST_PATH_IMAGE030
. By optimizing the feature extractor
Figure 78039DEST_PATH_IMAGE024
And senone classifier
Figure 221445DEST_PATH_IMAGE027
The network is trained to minimize senone classification loss during the training phase. This enables the network to find a function that can distinguish the senone tags. In order to make the feature region invariant, the parameters of the feature extractor are optimized to maximize the region classification loss. At the same time, to Domain classifier
Figure 430709DEST_PATH_IMAGE029
Is optimized to minimize classification loss by performing a feature extractor
Figure 193129DEST_PATH_IMAGE024
And Domain classifier
Figure 171449DEST_PATH_IMAGE029
Introducing GRL. In the forward pass, the GRL takes the gradient from subsequent layers during the backward pass. During the inference process, the Domain classifier and GRL are ignored. The acoustic feature vector passes through the feature extractor and senone classifier and predicts the senone label.
All domains have public (Shared) and Private (Private) features, and if the Private features of each domain are also migrated, negative (negative) migration will occur.
Next, the method proposes Domain Separation Networks (DSNs). Meanwhile, the assumption of the problem by the authors is that unsupervised transfer learning is assumed, that is, the target domain data is unmarked and the source domain data is marked. The main work of the domain separation network is divided into two parts: (1) And (2) extracting common features among different domains, and migrating by using the common features. The overall network structure is shown in fig. 5, which is a structure diagram of a domain separation network of the embodiment in fig. 5.
Wherein, the main structure of the domain separation network is a structure similar to a self-encoder. The whole network structure can be divided into the following five parts:
Private Target Encoder
Figure 220177DEST_PATH_IMAGE031
a target domain private encoder for extracting the private features of the target domain.
Private Source Encoder
Figure 907948DEST_PATH_IMAGE032
A source domain private encoder to extract private features of the source domain.
Shared Encoder
Figure 208479DEST_PATH_IMAGE033
Shared encoder for extracting common features of source domain and target domain.
Shared Decoder
Figure 572465DEST_PATH_IMAGE034
Shared decoder, which is used to decode the samples composed of private characteristic and public characteristic.
Classifier
Figure 526514DEST_PATH_IMAGE035
The classifier is used for classifying the source domain samples during training and can be directly used for classifying the target domain data after the training is finished.
Starting from the structure of the encoder, the source domain samples
Figure 710371DEST_PATH_IMAGE036
First enter into
Figure 142489DEST_PATH_IMAGE032
And
Figure 767506DEST_PATH_IMAGE033
then the two encoders output separately
Figure 361298DEST_PATH_IMAGE037
And
Figure 829188DEST_PATH_IMAGE038
these two outputs correspond to private and public features in the source domain data, respectively, and, similarly,
Figure 799418DEST_PATH_IMAGE039
and
Figure 544521DEST_PATH_IMAGE040
corresponding to private and public features in the target domain data, respectively. It can be known from the above principle that if the domain separation network is to achieve better effect, the public part must be the feature common to the real source domain and the target domain, and the public part and the private part must be completely separated to effectively avoid negative migration, and the following two loss functions are used to constrain
Figure 312144DEST_PATH_IMAGE031
Figure 267330DEST_PATH_IMAGE032
And
Figure 775672DEST_PATH_IMAGE033
Figure 375281DEST_PATH_IMAGE041
the difference loss is a function one of which is used to constrain each encoder. Since the domain classifier cannot share the domain representing the predicted sample according to it. Parameters of classifiers when sharing the parameters of the encoder, the GRL module trains to minimize the domain loss function to maximize the domain classification loss. So that the migration can be performed after the loss function is derived, i.e. both are migrated
Figure 310876DEST_PATH_IMAGE038
And
Figure 753358DEST_PATH_IMAGE040
and the matching is carried out, and the similarity of the two is improved in a visual embodiment.
The second loss function to be used is then the Maximum Mean Difference (MMD), which is often used in transfer learning. This loss is all measured
Figure 799812DEST_PATH_IMAGE038
And
Figure 847402DEST_PATH_IMAGE040
the magnitude of the difference. Ensure that
Figure 219478DEST_PATH_IMAGE037
And
Figure 24623DEST_PATH_IMAGE039
and
Figure 143276DEST_PATH_IMAGE038
and
Figure 779794DEST_PATH_IMAGE040
completely separate, at the same time
Figure 322770DEST_PATH_IMAGE042
And
Figure 474266DEST_PATH_IMAGE040
when the distribution of the data is approximately equal, the integrity of the source domain data and the target domain data needs to be ensured. The 'coding-decoding' structure can ensure the extraction of public and private characteristics and the integrity and effectiveness of the characteristics, namely reconstruction loss
Figure 128101DEST_PATH_IMAGE043
And the mean square error:
Figure 556809DEST_PATH_IMAGE044
Figure 270687DEST_PATH_IMAGE045
for source domain samples, shared Decoder
Figure 112741DEST_PATH_IMAGE034
Is decoded to obtain
Figure 367005DEST_PATH_IMAGE046
And it is desirable to minimize reconstruction loss during training. In the process that the source domain samples enter the encoder and come out of the decoder, the two decoders respectively extract the private characteristics and the public characteristics, and meanwhile, the decoder requires the combination of the public characteristics and the private characteristics to form a complete source domain sample, so that the integrity of the characteristics is guaranteed. Similarly, when the target domain sample entersThe same is true.
Finally, the coding-decoding unsupervised method extracts the public characteristics and the private characteristics of the source domain and the target domain, and still needs to classify the samples, so a classifier is added
Figure 715465DEST_PATH_IMAGE035
. In the previous process, the self-encoder has extracted the common part of the source and target domains and approximated the two in distribution by minimizing the similarity function. It can be said that a classifier is valid on the common part of the source domain and is valid on the common part of the source domain of the target domain, so that the classifier only needs to be trained using labeled source domain data when training, and can be directly applied on the target domain when training is completed.
S203: and identifying a plurality of human voice audio frequency segments according to the self-adaptive acoustic identification model to obtain acoustic identification characteristics.
S204: and performing semantic recognition on the acoustic recognition features based on the context neural language model to obtain a voice recognition result.
Optionally, semantically recognizing the acoustic recognition features by the context-based neural language model, and obtaining a speech recognition result includes:
inputting the acoustic recognition features as an input sequence into a long-term and short-term memory network;
modeling the probability of each vocabulary of the input sequence by adopting a long-term and short-term memory network to obtain the word-level probability of each vocabulary, and determining the fixed size embedding of each vocabulary as a fixed embedding vector based on the word-level probability;
generating a word-by-word embedding vector on the context representation by a context-embedded standard word embedding module with a context-carried neural language model, and embedding by adopting a fixed embedding vector to obtain an enhanced sequence corresponding to an input sequence;
and performing semantic recognition on the enhanced sequence through a mask language model to obtain a voice recognition result.
Optionally, the fusion layer of the mask language model is a transform-XL architecture, and the decoder includes a multi-headed attention layer with residual concatenation and a full concatenation.
Specifically, first, the probability of each word of an input sequence in a speech recognition system is modeled based on the neural language model of LSTM. The end result is to derive a fixed size embedding for each word, then project to the word-level output using the output from the LSTM, then convert the word-level output to the final word-level probabilities using the Softmax layer.
Next, feature enhancement based on context embedding is performed. However, the neural language model with Context Carrying (CCO) based on context embedding has the problems of excessive concentration of nearby contexts or fuzzy understanding of far-end of contexts, and the method can perform input feature enhancement so as to use various context embedding. Average word embedding is performed first, namely, a standard word embedding module generates word-by-word embedding vectors on context representation. Then, fixed-length embedding is performed, and the fixed-length embedding vector is obtained by averaging the embedded vectors over the whole context sequence.
Next, training will be performed using the transform-XL model architecture of the Masked Language Model (MLM) fusion layer. When a new segment is processed as an extended context, the previous segment is reused. The final model consists of a stack of decoder modules, where each module consists of a multi-head attention (MHA) layer with residual connections and a set of Fully Connected (FC) layers, and finally linear and soft maximum layers. The MHA layer adopts an upper triangular mask to ensure unidirectionality in decoding. Then, fusion of MLM and Transformer-XL was performed.
Semantic embedding and transform-XL fusion derived from Mask Language Model (MLM) can be divided into early fusion and late fusion. Early Fusion (Early Fusion) explores the Fusion of input modes in combination with input word embedding and BERT-derived embedding at each time frame. In this fused form, the hidden states from the last layer of the transform-XL decoder are connected to the BERT derived embedded connection layer. Late Fusion (Late Fusion) training adjusts the flow of information between transform-X and BERT-derived embeddings. The samples of the MLM in this embedded domain pass the output of the fusion layer to the linear layer, which in turn predicts the next word in the sequence by softmax.
The model has the effects that the model can effectively adapt to rich semantic information concentrated in BERT embedding generated by a voice mode provided by a user, and can remarkably improve a neural language model in a task-oriented dialog system by combining context carried information and features.
In the embodiment, acoustic preprocessing is performed on a voice signal to be recognized to obtain a plurality of voice audio fragments, the voice audio fragments are adopted to perform transfer learning on an initial acoustic recognition model to obtain an adaptive acoustic recognition model, the initial acoustic recognition model is an acoustic recognition model of a high-resource language, the voice audio fragments are recognized according to the adaptive acoustic recognition model to obtain acoustic recognition characteristics, semantic recognition is performed on the acoustic recognition characteristics based on a context neural language model to obtain a voice recognition result, extraction of voice information of a low-resource language from a complex environment is achieved, semantic recognition is performed on the voice information of the low-resource language, and accuracy of the voice recognition is improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 2 shows a schematic block diagram of a speech recognition apparatus in one-to-one correspondence with the speech recognition methods of the above-described embodiments. As shown in fig. 2, the speech recognition apparatus includes an acoustic preprocessing module 31, a migration learning module 32, an acoustic recognition module 33, and a semantic recognition module 34. The detailed description of each functional module is as follows:
the acoustic preprocessing module 31 is configured to perform acoustic preprocessing on a voice signal to be recognized to obtain a plurality of voice audio segments;
the transfer learning module 32 is configured to perform transfer learning on the initial acoustic recognition model by using the human voice audio clip to obtain an adaptive acoustic recognition model, where the initial acoustic recognition model is an acoustic recognition model of a high-resource language;
the acoustic recognition module 33 is configured to recognize the plurality of human voice audio segments according to the adaptive acoustic recognition model to obtain acoustic recognition features;
and the semantic recognition module 34 is configured to perform semantic recognition on the acoustic recognition features based on the contextual neural language model to obtain a speech recognition result.
Optionally, the acoustic preprocessing module 31 includes:
the voice extraction submodule is used for extracting voice of the voice signal to be recognized to obtain a first voice signal;
the recognition training submodule is used for recognizing and training the first human voice signal by adopting an acoustic pre-training model to obtain a second human voice signal;
and the segmentation alignment submodule is used for segmenting and aligning the second voice signal based on the CTC end-to-end network to obtain a plurality of voice audio segments.
Optionally, the partitioning alignment sub-module comprises:
the signal framing unit is used for framing the second voice signal to obtain a voice frame signal;
the signal alignment unit is used for determining the alignment mode of the transcribed text corresponding to each audio frame signal and the second human voice signal based on the CTC end-to-end network aiming at each voice frame signal;
a division position determining unit for determining a division position according to the alignment mode;
and the signal dividing unit is used for dividing the second voice signal through the dividing position to obtain a plurality of voice audio signals.
Optionally, the segmentation position determination unit includes:
the signal identification subunit is used for identifying the voice frame signals based on an end-to-end network of CTC to obtain the character posterior probability corresponding to the voice frame signals;
the dynamic programming subunit is used for calculating the posterior probabilities of all the characters in a dynamic programming mode to obtain the maximum joint probability;
and the signal alignment subunit is used for determining the alignment mode of the speech frame signal and the transcribed text corresponding to the second human voice signal based on the maximum joint probability.
Optionally, the migration learning module 32 includes:
the acoustic feature extraction submodule is used for inputting the human voice audio clip into the initial acoustic recognition model to carry out acoustic feature extraction so as to obtain a first acoustic feature;
and the migration countermeasure training sub-module is used for performing migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature, a first acoustic feature and a gradient inversion layer corresponding to the high-resource language corpus to obtain a self-adaptive acoustic recognition model, wherein the gradient inversion layer comprises a DNN structural feature extractor, a multi-element phoneme classifier and a domain classifier based on the learning acoustic model.
Optionally, the migration countermeasure training sub-module comprises:
the characteristic extraction unit is used for respectively extracting the characteristics of the first acoustic characteristic and the second acoustic characteristic based on a DNN structural characteristic extractor of a learning acoustic model to obtain a first structural characteristic corresponding to the first acoustic characteristic and a second structural characteristic corresponding to the second acoustic characteristic;
the first label generating unit is used for classifying phonemes by adopting a multi-element phoneme classifier according to the first structural feature and the second structural feature and generating phoneme labels according to a classification result;
the second label generating unit is used for carrying out domain classification by adopting a domain classifier aiming at the first structural feature and the second structural feature and generating a domain label according to a classification result;
and the model training unit is used for carrying out migration countermeasure training on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature and the second structural feature to obtain the self-adaptive acoustic recognition model.
Optionally, the domain label includes a source domain and a target domain, and the model training unit includes:
the public feature extraction subunit is used for performing public feature extraction and private feature extraction on the first structural features and the second structural features of the tags in different fields by adopting an encoder of a domain separation network to obtain public features and private features corresponding to the tags in each field, wherein the domain separation network comprises a target domain private encoder, a source domain private encoder, a shared decoder and a classifier;
the classifier training subunit is used for decoding a sample consisting of the private features and the public features through a shared decoder of the domain separation network, and performing classification training on the decoded source domain sample by adopting a classifier to obtain a trained classifier, wherein each private feature and each public feature comprise different phoneme labels;
the data classification subunit is used for classifying the target domain data by adopting the trained classifier to obtain a target classification;
and the migration countermeasure training subunit is used for performing migration countermeasure training by adopting the maximum mean difference as a loss function according to the common characteristics corresponding to the target classification to obtain the self-adaptive acoustic recognition model.
Optionally, the semantic recognition module 34 includes:
the characteristic input submodule is used for inputting the acoustic recognition characteristics serving as an input sequence into the long-term and short-term memory network;
the vector calculation submodule is used for modeling the probability of each vocabulary of the input sequence by adopting a long-term and short-term memory network to obtain the word-level probability of each vocabulary, and determining the fixed size embedding of each vocabulary as a fixed embedding vector based on the word-level probability;
the sequence enhancement submodule is used for generating word-by-word embedded vectors on the context representation based on a context embedded standard word embedding module with a context-carried neural language model, and embedding the vectors by adopting fixed embedded vectors to obtain an enhancement sequence corresponding to the input sequence;
and the semantic recognition submodule is used for performing semantic recognition on the enhanced sequence through the mask language model to obtain a voice recognition result.
For the specific limitations of the speech recognition device, reference may be made to the above limitations of the speech recognition method, which are not described herein again. The respective modules in the above-described speech recognition apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 3 in detail, fig. 3 is a block diagram of a basic structure of a computer device according to the embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only the computer device 4 having the components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing control of an electronic file.
The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.
The present application further provides another embodiment, which is to provide a computer-readable storage medium storing an interface display program, which is executable by at least one processor to cause the at least one processor to perform the steps of the speech recognition method as described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (13)

1. A speech recognition method, characterized in that the speech recognition method comprises:
performing acoustic pretreatment on a voice signal to be recognized to obtain a plurality of voice audio segments;
adopting the voice audio clip to perform transfer learning on an initial acoustic recognition model to obtain a self-adaptive acoustic recognition model, wherein the initial acoustic recognition model is an acoustic recognition model of a high-resource language;
identifying the plurality of voice audio segments according to the self-adaptive acoustic identification model to obtain acoustic identification characteristics;
performing semantic recognition on the acoustic recognition features based on a context neural language model to obtain a voice recognition result;
the method for performing transfer learning on the initial acoustic recognition model by using the human voice audio clip to obtain the self-adaptive acoustic recognition model comprises the following steps:
inputting the voice audio clip into the initial acoustic recognition model for acoustic feature extraction to obtain a first acoustic feature;
and performing migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature corresponding to the high-resource language corpus, the first acoustic feature and a gradient inversion layer to obtain the self-adaptive acoustic recognition model, wherein the gradient inversion layer comprises a DNN structural feature extractor based on a learned acoustic model, a multi-element phoneme classifier and a domain classifier.
2. The speech recognition method of claim 1, wherein the acoustic preprocessing of the speech signal to be recognized to obtain a plurality of human voice audio segments comprises:
carrying out voice extraction on the voice signal to be recognized to obtain a first voice signal;
carrying out recognition training on the first human voice signal by adopting an acoustic pre-training model to obtain a second human voice signal;
and based on the CTC end-to-end network, segmenting and aligning the second voice signal to obtain a plurality of voice audio segments.
3. The speech recognition method of claim 2, wherein the voice extraction is performed by means of decibel filtering and/or spectral detection.
4. The speech recognition method of claim 2, wherein the CTC-based end-to-end network segmenting and aligning the second vocal signal into a number of the vocal audio segments comprises:
performing framing processing on the second voice signal to obtain a voice frame signal;
for each voice frame signal, determining an alignment mode of a transcribed text corresponding to each audio frame signal and the second voice signal based on the end-to-end network of CTCs;
determining a segmentation position according to the alignment mode;
and segmenting the second voice signal through the segmentation position to obtain a plurality of voice audio clips.
5. The speech recognition method of claim 4, wherein said determining, for each of the speech frame signals, an alignment of the transcribed text corresponding to each audio frame signal and the second human voice signal based on an end-to-end network of CTCs comprises:
identifying the voice frame signal based on the end-to-end network of the CTC to obtain the character posterior probability corresponding to the voice frame signal;
calculating the posterior probabilities of all the characters by adopting a dynamic programming mode to obtain the maximum joint probability;
and determining the alignment mode of the transcribed text corresponding to the speech frame signal and the second human voice signal based on the maximum joint probability.
6. A speech recognition method according to any one of claims 1 to 5, wherein the initial acoustic recognition model is a hidden Markov speech recognition model.
7. The speech recognition method according to claim 1, wherein the performing migration countermeasure training on the initial acoustic recognition model by using the second acoustic feature, the first acoustic feature and the gradient inversion layer corresponding to the high-resource language corpus to obtain the adaptive acoustic recognition model comprises:
a DNN structural feature extractor based on a learning acoustic model is used for respectively extracting the first acoustic feature and the second acoustic feature to obtain a first structural feature corresponding to the first acoustic feature and a second structural feature corresponding to the second acoustic feature;
aiming at the first structural feature and the second structural feature, adopting the multi-element phoneme classifier to classify phonemes, and generating phoneme labels according to classification results;
for the first structural feature and the second structural feature, performing domain classification by using the domain classifier, and generating a domain label according to a classification result;
and performing migration countermeasure training on the initial acoustic recognition model based on the domain label, the phoneme label, the first structural feature and the second structural feature to obtain the self-adaptive acoustic recognition model.
8. The speech recognition method of claim 7, wherein the domain labels comprise a source domain and a target domain, and wherein performing a migration countermeasure training on the initial acoustic recognition model based on the domain labels, the phoneme labels, the first structural features, and the second structural features to obtain the adaptive acoustic recognition model comprises:
performing public feature extraction and private feature extraction on first structural features and second structural features of tags in different fields by adopting an encoder of a domain separation network to obtain public features and private features corresponding to the tags in each field, wherein the domain separation network comprises a target domain private encoder, a source domain private encoder, a shared decoder and a classifier, and each private feature and each public feature comprise different phoneme tags;
decoding a sample consisting of the private characteristics and the public characteristics through a shared decoder of a domain separation network, and performing classification training on a source domain sample after decoding by adopting a classifier to obtain a trained classifier;
classifying the target domain data by adopting the trained classifier to obtain a target classification;
and performing migration countermeasure training by adopting the maximum mean difference as a loss function according to the common characteristics corresponding to the target classification to obtain the self-adaptive acoustic recognition model.
9. The speech recognition method of any one of claims 1 to 5, wherein the context-based neural language model semantically recognizes the acoustic recognition features, and obtaining the speech recognition result comprises:
inputting the acoustic recognition features as an input sequence into a long-short term memory network;
modeling the probability of each vocabulary of the input sequence by adopting a long-short term memory network to obtain the word-level probability of each vocabulary, and determining the fixed size embedding of each vocabulary as a fixed embedding vector based on the word-level probability;
generating a word-by-word embedding vector on the context representation by a context-embedded standard word embedding module with a context-carried neural language model, and embedding by adopting the fixed embedding vector to obtain an enhanced sequence corresponding to the input sequence;
and performing semantic recognition on the enhanced sequence through a mask language model to obtain the voice recognition result.
10. The speech recognition method of claim 9, wherein the fusion layer of the masking language model is a transform-XL architecture, and the decoder comprises a multi-headed attention layer with residual connections and full connections.
11. A speech recognition apparatus, characterized in that the speech recognition apparatus comprises:
the acoustic preprocessing module is used for performing acoustic preprocessing on a voice signal to be recognized to obtain a plurality of voice audio segments;
the transfer learning module is used for carrying out transfer learning on an initial acoustic recognition model by adopting the human voice audio clip to obtain a self-adaptive acoustic recognition model, wherein the initial acoustic recognition model is an acoustic recognition model of a high-resource language;
the acoustic recognition module is used for recognizing the plurality of human voice audio frequency segments according to the self-adaptive acoustic recognition model to obtain acoustic recognition characteristics;
the semantic recognition module is used for carrying out semantic recognition on the acoustic recognition characteristics based on the context neural language model to obtain a voice recognition result;
wherein the migration learning module comprises:
the acoustic feature extraction sub-module is used for inputting the human voice audio clip into the initial acoustic recognition model to perform acoustic feature extraction to obtain a first acoustic feature;
and the migration countermeasure training sub-module is used for performing migration countermeasure training on the initial acoustic recognition model by adopting a second acoustic feature corresponding to the high-resource language corpus, the first acoustic feature and a gradient inversion layer to obtain the self-adaptive acoustic recognition model, wherein the gradient inversion layer comprises a DNN structural feature extractor, a multi-element phoneme classifier and a domain classifier based on a learning acoustic model.
12. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the speech recognition method according to any one of claims 1 to 10 when executing the computer program.
13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the speech recognition method according to any one of claims 1 to 10.
CN202210916222.2A 2022-08-01 2022-08-01 Voice recognition method, device, equipment and medium Active CN114999463B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210916222.2A CN114999463B (en) 2022-08-01 2022-08-01 Voice recognition method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210916222.2A CN114999463B (en) 2022-08-01 2022-08-01 Voice recognition method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN114999463A CN114999463A (en) 2022-09-02
CN114999463B true CN114999463B (en) 2022-11-15

Family

ID=83021277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210916222.2A Active CN114999463B (en) 2022-08-01 2022-08-01 Voice recognition method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN114999463B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116233540B (en) * 2023-03-10 2024-04-02 北京富通亚讯网络信息技术有限公司 Parallel signal processing method and system based on video image recognition

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107481717A (en) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 A kind of acoustic training model method and system
CN108682417A (en) * 2018-05-14 2018-10-19 中国科学院自动化研究所 Small data Speech acoustics modeling method in speech recognition
CN109003601A (en) * 2018-08-31 2018-12-14 北京工商大学 A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN109326282A (en) * 2018-10-10 2019-02-12 内蒙古工业大学 A kind of small-scale corpus DNN-HMM acoustics training structure
WO2021183229A1 (en) * 2020-03-13 2021-09-16 Microsoft Technology Licensing, Llc Cross-speaker style transfer speech synthesis
CN113574595A (en) * 2019-03-25 2021-10-29 三菱电机株式会社 System and method for end-to-end speech recognition with triggered attention

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107481717A (en) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 A kind of acoustic training model method and system
CN108682417A (en) * 2018-05-14 2018-10-19 中国科学院自动化研究所 Small data Speech acoustics modeling method in speech recognition
CN109003601A (en) * 2018-08-31 2018-12-14 北京工商大学 A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN109326282A (en) * 2018-10-10 2019-02-12 内蒙古工业大学 A kind of small-scale corpus DNN-HMM acoustics training structure
CN113574595A (en) * 2019-03-25 2021-10-29 三菱电机株式会社 System and method for end-to-end speech recognition with triggered attention
WO2021183229A1 (en) * 2020-03-13 2021-09-16 Microsoft Technology Licensing, Llc Cross-speaker style transfer speech synthesis

Also Published As

Publication number Publication date
CN114999463A (en) 2022-09-02

Similar Documents

Publication Publication Date Title
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
CN107305768B (en) Error-prone character calibration method in voice interaction
US10074363B2 (en) Method and apparatus for keyword speech recognition
CN106683677B (en) Voice recognition method and device
CN110706690A (en) Speech recognition method and device
WO2017127296A1 (en) Analyzing textual data
CN110634469B (en) Speech signal processing method and device based on artificial intelligence and storage medium
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN113707125B (en) Training method and device for multi-language speech synthesis model
CN110853628A (en) Model training method and device, electronic equipment and storage medium
JP2023545988A (en) Transformer transducer: One model that combines streaming and non-streaming speech recognition
JP6875819B2 (en) Acoustic model input data normalization device and method, and voice recognition device
US11961515B2 (en) Contrastive Siamese network for semi-supervised speech recognition
CN112151015A (en) Keyword detection method and device, electronic equipment and storage medium
WO2023048746A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
CN112562640A (en) Multi-language speech recognition method, device, system and computer readable storage medium
CN112818680A (en) Corpus processing method and device, electronic equipment and computer-readable storage medium
CN112331229A (en) Voice detection method, device, medium and computing equipment
CN114999463B (en) Voice recognition method, device, equipment and medium
CN112331207A (en) Service content monitoring method and device, electronic equipment and storage medium
CN115457938A (en) Method, device, storage medium and electronic device for identifying awakening words
CN114254587A (en) Topic paragraph dividing method and device, electronic equipment and storage medium
Chadha et al. Current Challenges and Application of Speech Recognition Process using Natural Language Processing: A Survey
CN116052655A (en) Audio processing method, device, electronic equipment and readable storage medium
CN112037772B (en) Response obligation detection method, system and device based on multiple modes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant