CN114550702A - Voice recognition method and device - Google Patents

Voice recognition method and device Download PDF

Info

Publication number
CN114550702A
CN114550702A CN202210312961.0A CN202210312961A CN114550702A CN 114550702 A CN114550702 A CN 114550702A CN 202210312961 A CN202210312961 A CN 202210312961A CN 114550702 A CN114550702 A CN 114550702A
Authority
CN
China
Prior art keywords
training
audio data
data sample
sample
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210312961.0A
Other languages
Chinese (zh)
Inventor
雪巍
范璐
丁国宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Information Technology Co Ltd
Original Assignee
Jingdong Technology Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Information Technology Co Ltd filed Critical Jingdong Technology Information Technology Co Ltd
Priority to CN202210312961.0A priority Critical patent/CN114550702A/en
Publication of CN114550702A publication Critical patent/CN114550702A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Abstract

The invention discloses a voice recognition method and device, and relates to the technical field of computers. One embodiment of the method comprises: extracting pre-training features corresponding to the unmarked first audio data sample through a feature extraction network, and obtaining a normalization weight vector of a phoneme of the first audio data sample through a feature mapping network based on the pre-training features; and training a speech recognition model by taking the normalized weight vector as a training target corresponding to the first audio data sample and taking the label of the labeled second audio data sample as a training target corresponding to the second audio data sample, and performing speech recognition by using the trained speech recognition model. The implementation mode can solve the problems of data dependence and voice characterization of voice recognition, effectively utilizes the unmarked audio data in the voice recognition product to improve the voice recognition performance, reduces the labor marking cost, and solves the problems that the prior art ignores voice phase information and has defects in the complex voice characteristic modeling capability.

Description

Voice recognition method and device
Technical Field
The invention relates to the technical field of computers, in particular to a voice recognition method and a voice recognition device.
Background
Speech recognition techniques aim at solving the problem of conversion from speech audio signals to speech text. Based on the voice recognition result, the technology of natural language understanding, multi-mode fusion and the like is fused, so that the aim of man-machine interaction can be fulfilled. The current speech recognition system usually adopts a supervised training scheme, i.e. the collected audio data are labeled manually, and the classifier for speech recognition is trained by taking character labeling as a final target according to the original audio data and characteristics. Currently, the commonly used speech recognition techniques fall into two broad categories. A hybrid framework based on a hidden Markov deep neural network (HMM-DNN) is divided into an acoustic model and a language model, and an optimal sequence is obtained by Viterbi search in the recognition process by adopting a decoding algorithm to generate decoding output. Another type of speech recognition algorithm is based on end-to-end neural network design, and an optimization target is designed through a CTC (connection Temporal Classification) criterion, so that the neural network directly outputs a recognition character result according to original audio features.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the manual marking cost is usually high, the consumed time is long, the marking quality needs to be checked, and the method is not suitable for training of ultra-large-scale voice recognition; a large amount of non-labeled audio data generated every day in the existing voice recognition products cannot be effectively utilized; the traditional speech recognition system based on the characteristics of MFCC (Mel cepstrum coefficient), FBANK (Filter Bank characteristic) and the like ignores the speech phase information, and has a certain defect on the modeling capacity of complex speech characteristics based on the simplified filter theory extraction.
Disclosure of Invention
In view of this, embodiments of the present invention provide a speech recognition method and apparatus, which can solve the problems of data dependency and speech characterization in various service fields and application scenarios in speech recognition, and can effectively utilize a large amount of non-labeled audio data in existing speech recognition products to improve the performance of speech recognition, reduce the manual labeling cost, reduce the labeling time consumption, improve the labeling accuracy, are suitable for training of ultra-large scale speech recognition, and solve the problems of ignoring speech phase information and having defects in the complex speech characteristic modeling capability in the prior art.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a speech recognition method.
A speech recognition method, comprising: extracting pre-training features corresponding to a first audio data sample without labels through a feature extraction network, and obtaining a normalization weight vector of a phoneme of the first audio data sample through a feature mapping network based on the pre-training features corresponding to the first audio data sample, wherein the normalization weight vector represents the category of the phoneme of the first audio data sample; and taking the normalized weight vector as a training target corresponding to the first audio data sample, taking a label of a labeled second audio data sample as a training target corresponding to the second audio data sample, training a speech recognition model by using the first audio data sample and the second audio data sample, and performing speech recognition by using the trained speech recognition model, wherein the label of the second audio data sample represents the class of phonemes of the second audio data sample.
Optionally, before the extracting the pre-training feature corresponding to the unmarked first audio data sample by the feature extraction network, the method includes: extracting pre-training features corresponding to the labeled third audio data sample through the feature extraction network; and training the feature mapping network by taking a pre-training feature corresponding to the third audio data sample as input of the feature mapping network and taking a label of the third audio data sample as a training target, wherein the label of the third audio data sample represents the category of a phoneme of the third audio data sample.
Optionally, before the extracting, by the feature extraction network, the pre-training feature corresponding to the labeled third audio data sample, the method includes: constructing training samples of the feature extraction network by using fourth audio data samples without labels, wherein each training sample is combined to obtain a training sample subset; inputting the training sample subset into the feature extraction network to obtain a network output result corresponding to each training sample in the training sample subset; clustering the network output results corresponding to the training samples to obtain a matching combination of the training samples and a clustering center, and updating the clustering center according to the matching combination; and updating the network parameters of the feature extraction network by back propagation by taking a clustering criterion function as a loss function during the feature extraction network training, wherein the clustering criterion function is constructed according to the network output result and the clustering center.
Optionally, the loss function is constructed by: taking the network output result of the ith training sample as the ith target sample yiIn addition to ckRepresenting distance to target sample yiAnd (3) constructing a first relation by a nearest clustering center as follows: ith target sample yiAnd distance target sample yiNearest cluster center ckThe square of the absolute value of the difference, a second relation is constructed: calculating the sum of the first relational expression corresponding to the values of each i in the value range of i from 1 to M, wherein M is the number of training samples in a single training sample subset; and taking the second relation as the loss function.
Optionally, the constructing a training sample of the feature extraction network by using the fourth audio data sample without label includes: performing time-frequency transformation on the fourth audio data sample to obtain frame-level original voice features of the fourth audio data sample, wherein the frame-level original voice features comprise amplitude spectral vectors and phase spectral vectors of each frame of the fourth audio data sample; based on the frame-level original speech features, fusing the context information of each frame of audio signal in the magnitude spectral vector and the phase spectral vector to construct a training sample of the feature extraction network.
Optionally, the fusing the context information of each frame of audio signal in the magnitude spectrum vector and the phase spectrum vector based on the frame-level original speech features to construct a training sample of the feature extraction network, including: and for any t-th frame voice original feature in the frame-level voice original features, splicing magnitude spectral vectors and phase spectral vectors of all frames from the t-D frame to the t + D frame according to a preset rule to obtain a training sample of the feature extraction network corresponding to the t-th frame, wherein D represents the window length of a preset context.
According to another aspect of the embodiments of the present invention, there is provided a speech recognition apparatus.
A speech recognition apparatus comprising: the normalization weight vector determination module is used for extracting pre-training features corresponding to a first audio data sample without labels through a feature extraction network, and obtaining a normalization weight vector of a phoneme of the first audio data sample through a feature mapping network based on the pre-training features corresponding to the first audio data sample, wherein the normalization weight vector represents the category of the phoneme of the first audio data sample; and the speech recognition model training module is used for taking the normalized weight vector as a training target corresponding to the first audio data sample, taking a label of a labeled second audio data sample as a training target corresponding to the second audio data sample, training a speech recognition model by using the first audio data sample and the second audio data sample, and performing speech recognition by using the trained speech recognition model, wherein the label of the second audio data sample represents the class of phonemes of the second audio data sample.
Optionally, the system further includes a feature mapping network training module, configured to: extracting pre-training features corresponding to the labeled third audio data sample through the feature extraction network; and training the feature mapping network by taking a pre-training feature corresponding to the third audio data sample as an input of the feature mapping network and taking a label of the third audio data sample as a training target, wherein the label of the third audio data sample represents a category of a phoneme of the third audio data sample.
Optionally, the system further comprises a feature extraction network training module, configured to: constructing training samples of the feature extraction network by using fourth audio data samples without labels, wherein each plurality of training samples are combined to obtain a training sample subset; inputting the training sample subset into the feature extraction network to obtain a network output result corresponding to each training sample in the training sample subset; clustering the network output results corresponding to the training samples to obtain a matching combination of the training samples and a clustering center, and updating the clustering center according to the matching combination; and updating the network parameters of the feature extraction network by back propagation by taking a clustering criterion function as a loss function during the feature extraction network training, wherein the clustering criterion function is constructed according to the network output result and the clustering center.
Optionally, the loss function is constructed by: taking the network output result of the ith training sample as the ith target sample yiIn addition to ckRepresenting distance to target sample yiAnd (3) constructing a first relation by a nearest clustering center as follows: ith target sample yiAnd distance target sample yiNearest cluster center ckThe square of the absolute value of the difference between the two, a second relation is constructed: calculating the sum of the first relational expression corresponding to the values of each i in the value range of i from 1 to M, wherein M is the number of training samples in a single training sample subset; and taking the second relation as the loss function.
Optionally, the feature extraction network training module includes a training sample construction sub-module, configured to: performing time-frequency transformation on the fourth audio data sample to obtain frame-level original voice features of the fourth audio data sample, wherein the frame-level original voice features comprise amplitude spectral vectors and phase spectral vectors of each frame of the fourth audio data sample; based on the frame-level original speech features, fusing the context information of each frame of audio signal in the magnitude spectral vector and the phase spectral vector to construct a training sample of the feature extraction network.
Optionally, the training sample construction sub-module is further configured to: and for any t-th frame voice original feature in the frame-level voice original features, splicing magnitude spectrum vectors and phase spectrum vectors of all frames from the t-D frame to the t + D frame according to a preset rule to obtain a training sample of the feature extraction network corresponding to the t-th frame, wherein D represents the window length of a preset context.
According to yet another aspect of an embodiment of the present invention, an electronic device is provided.
An electronic device, comprising: one or more processors; a memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the speech recognition methods provided by embodiments of the present invention.
According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.
A computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements a speech recognition method provided by an embodiment of the present invention.
One embodiment of the above invention has the following advantages or benefits: extracting pre-training features corresponding to the unmarked first audio data sample through a feature extraction network, and obtaining a normalization weight vector of a phoneme of the first audio data sample through a feature mapping network based on the pre-training features corresponding to the first audio data sample, wherein the normalization weight vector represents the category of the phoneme of the first audio data sample; and taking the normalized weight vector as a training target corresponding to the first audio data sample, taking a label of the labeled second audio data sample as a training target corresponding to the second audio data sample, training a speech recognition model by using the first audio data sample and the second audio data sample, and performing speech recognition by using the trained speech recognition model, wherein the label of the second audio data sample represents the class of the phoneme of the second audio data sample. The method can solve the problems of data dependence and voice characterization of voice recognition in various service fields and application scenes, can effectively utilize a large amount of unmarked audio data in the existing voice recognition products to improve the performance of voice recognition, reduce the labor marking cost, reduce the marking time consumption, improve the marking accuracy, is suitable for the training of super-large-scale voice recognition, and solves the problems of neglecting voice phase information and having defects on the complex voice characteristic modeling capability in the prior art.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main steps of a speech recognition method according to one embodiment of the present invention;
FIG. 2 is a diagram of a speech feature pre-training framework employing clustering criteria according to one embodiment of the present invention;
FIG. 3 is a schematic diagram of the main blocks of a speech recognition apparatus according to one embodiment of the present invention;
FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
FIG. 1 is a schematic diagram of the main steps of a speech recognition method according to one embodiment of the present invention; as shown in fig. 1, the speech recognition method according to an embodiment of the present invention mainly includes the following steps S101 to S102:
step S101: extracting pre-training features corresponding to the unmarked first audio data sample through a feature extraction network, and obtaining a normalization weight vector of a phoneme of the first audio data sample through a feature mapping network based on the pre-training features corresponding to the first audio data sample, wherein the normalization weight vector represents the category of the phoneme of the first audio data sample;
step S102: and taking the normalized weight vector of the phoneme of the first audio data sample as a training target corresponding to the first audio data sample, taking a label of the labeled second audio data sample as a training target corresponding to the second audio data sample, training a speech recognition model by using the first audio data sample and the second audio data sample, and performing speech recognition by using the trained speech recognition model, wherein the label of the second audio data sample represents the category of the phoneme of the second audio data sample.
The feature extraction network may specifically employ various neural networks capable of performing feature extraction on the audio data. The pre-training features are the output of a feature extraction network.
The feature mapping network may be a neural network classifier.
Before extracting the pre-training features corresponding to the first audio data sample without labels through the feature extraction network, extracting the pre-training features corresponding to the third audio data sample with labels through the feature extraction network; and training the feature mapping network by taking the pre-training feature corresponding to the third audio data sample as the input of the feature mapping network and taking the label of the third audio data sample as a training target, wherein the label of the third audio data sample represents the class of the phoneme of the third audio data sample.
Before extracting the pre-training feature corresponding to the labeled third audio data sample through the feature extraction network, the feature extraction network may be constructed by using the unlabeled fourth audio data sample, where each of the plurality of training samples is combined to obtain a training sample subset (batch) which is recorded as { x }1,x2,x3,...,xM-wherein M is a blocksize, a self-defined value; inputting the training sample subset into a feature extraction network to obtain each training in the corresponding training sample subsetThe network output result of the sample is recorded as { y }1,y2,y3,...,yM}; clustering the network output results corresponding to the training samples to obtain a matching combination of the training samples and a clustering center, and updating the clustering center according to the matching combination; and taking a clustering criterion function as a loss function during the training of the feature extraction network, updating the network parameters of the feature extraction network through back propagation, wherein the clustering criterion function is constructed according to the network output result and the clustering center.
The loss function may be specifically constructed as follows: taking the network output result of the ith training sample as the ith target sample yiIn addition to ckRepresenting distance to target sample yiAnd (3) constructing a first relation by a nearest clustering center as follows: ith target sample yiAnd distance target sample yiNearest cluster center ckThe square of the absolute value of the difference, i.e. | yi-ck|2
Constructing a second relation as follows: and in the value range of i from 1 to M, calculating and summing the first relational expression corresponding to the values of each i, namely:
Figure BDA0003569039680000081
wherein M is the number of training samples in a single subset of training samples;
and taking a second relational expression, namely a clustering criterion function, as a loss function.
Constructing a training sample of the feature extraction network by using a fourth audio data sample without labels, wherein the method specifically comprises the following steps: performing time-frequency transformation on the fourth audio data sample to obtain frame-level voice original features of the fourth audio data sample, wherein the frame-level voice original features comprise amplitude spectral vectors and phase spectral vectors of all frames of the fourth audio data sample; based on the frame-level original speech features, the context information of each frame of audio signal in the magnitude spectral vector and the phase spectral vector is fused to construct a training sample of the feature extraction network.
And performing time-frequency transformation on the fourth audio data sample to obtain the frame-level original voice features of the fourth audio data sample, specifically, performing fourier transformation on each voice time domain signal of the fourth audio data sample, and extracting a two-dimensional magnitude spectrum and a two-dimensional phase spectrum of the fourier transformation. Fourier spectrums of different frequency bands are combined, the magnitude spectrum vector of the tth frame is recorded as A (t), the phase spectrum vector is recorded as P (t), and the original features of the frame-level voice are obtained.
Based on the frame-level original speech features, the context information of each frame of audio signal in the amplitude spectrum vector and the phase spectrum vector is fused to construct a training sample of the feature extraction network, and the method specifically comprises the following steps: and for any t-th frame voice original feature in the frame-level voice original features, splicing the magnitude spectral vectors and the phase spectral vectors of all frames from the t-D frame to the t + D frame according to a preset rule to obtain a training sample of the feature extraction network corresponding to the t-th frame, wherein D represents the window length of a preset context. The training sample x (t) of the t-th frame corresponding to the feature extraction network is obtained by the following splicing formula:
x(t)=[AT(t-D),AT(t-D+1),...,AT(t+D),PT(t-D),PT(t-D+1),...,PT(t+D)]T
wherein D represents the window length of the context, and the above formula represents the magnitude spectrum and the phase spectrum of all frames from the splicing time t-D to the time t + D.
In the above embodiments, the audio specifically refers to voice.
The embodiment of the invention provides a novel speech recognition framework based on semi-supervised learning. On one hand, a clustering criterion is adopted, and a pre-training network for extracting audio features is learned under an unsupervised learning framework, so that the same phoneme has the same clustering center. On the other hand, based on the existing voice recognition model, the voice feature pre-training network and the voice recognition model are further optimized by generating the training labels without labeled data. The speech recognition model may be based on a traditional Hybrid framework or on an end-to-end speech recognition framework.
Feature extraction is one of the key factors that determine speech recognition performance. Conventionally, based on a linear filter theory and by fusing human auditory perception characteristics, a group of linear filters, called as mel filter bank, is designed, so that the central frequency of the filter is distributed more densely in a low frequency region and has a larger bandwidth in a high frequency region. Based on the filter, the filter bank Features (FBANKs) of the speech can be extracted, and further discrete cosine transform is performed to obtain Mel-frequency Cepstral Coefficients (MFCC). However, on one hand, such features ignore speech phase information, and on the other hand, the ability to model complex speech features remains a drawback. In recent years, a neural network-based speech characterization pre-training model such as WAV2VEC is proposed, but the model is usually based on modeling of speech time sequence characteristics, and the output speech features have difficulty in explicitly expressing the corresponding relation between speech signals and phonemes.
The embodiment of the invention adopts large-scale unmarked voice audio data and provides a voice characteristic pre-training framework adopting a clustering criterion. Fig. 2 is a schematic diagram of a speech feature pre-training framework using a clustering criterion according to an embodiment of the present invention.
First, the clustering parameters in the clustering criterion function are initialized. Including the number of cluster centers K, each cluster center vector c1,c2,c3...,cK
For time-frequency transformation of large-scale voice data (namely, audio signals, or called audio data), firstly, Fourier transformation is carried out on each voice time domain signal, and a two-dimensional magnitude spectrum and a phase spectrum of the Fourier transformation are extracted. Combining Fourier spectrums of different frequency bands, recording the amplitude spectrum vector of the t frame as A (t), and the phase spectrum vector as P (t), and obtaining the original characteristics of the frame-level voice.
Based on the frame-level original speech features, combining the context information of each frame of audio signal in the amplitude spectrum and the phase spectrum, and constructing a training sample of the feature extraction network, namely:
x(t)=[AT(t-D),AT(t-D+1),...,AT(t+D),PT(t-D),PT(t-D+1),...,PT(t+D)]T
wherein D represents the window length of the context, and the above formula represents the magnitude spectrum and the phase spectrum of all frames from the splicing time t-D to the time t + D.
Randomly combining the training samples into a subset (batch), which is marked as { x1,x2,x3,...,xMWhere M is blocksize, a self-defined value. The input to each feature extraction network is a subset (batch). Sending each sample into the current feature extraction network, respectively obtaining the network output result of each training sample, and recording as { y1,y2,y3,...,yMTaking each y as a target sample.
Calculating a clustering criterion function of the current batch output result:
Figure BDA0003569039680000101
wherein, ckIs y from the current target sampleiNearest cluster center ck=argmincj|yi-cj|2. Based on the distance nearest criterion, the pairing combination of each target sample y and the cluster center can be obtained.
According to the clustering criterion function, the clustering criterion function is used as a loss function during the training of the feature extraction network, the neural network (namely the feature extraction network) is propagated reversely, and the network parameters of the feature extraction network are updated. Meanwhile, based on the obtained pairing combination of each target sample y and the clustering center, updating each clustering center:
Figure BDA0003569039680000102
N←N+P
wherein N is the current clustering center ckThe number of target samples contained, P is the current batch attributed to ckThe number of samples of (1).
Due to the fact that context information (namely magnitude spectrums and phase spectrums of all frames from t-D time to t + D time in splicing mode) is considered, the feature pre-training model method can effectively model the co-articulation phenomenon of the phonemes and can correspond to a three-factor model in speech recognition, for example, syllables of the speech context have influence on the articulation of the current syllable, such as continuous pronunciation, weak pronunciation and the like, and the co-articulation phenomenon of the phonemes can be modeled according to the context information.
The embodiment of the invention is based on a feature extraction network and utilizes label-free data to perform semi-supervised speech recognition training. The semi-supervised speech recognition training process comprises the following steps:
step one, aiming at the current audio data with labels, each frame of phoneme label can be obtained through recognition according to the existing speech recognition model. Meanwhile, based on the feature extraction network, the pre-training features of each frame of audio data can be obtained at the same time, and the pre-training features are the network output results of the feature extraction network.
And step two, further training a neural network classifier as a feature mapping network, learning the mapping from the pre-training features to the frame-level phoneme labels, and simultaneously fine-tuning the features to extract the network weight.
After the iteration is completed, for the current audio data without labels, extracting pre-training features according to a feature extraction network, and obtaining softmax weight vectors (normalized weight vectors) of the phonemes by adopting a feature mapping network to serve as training targets of the data without labels. The softmax weight vector for a phoneme is the output of the feature mapping network and is used to represent the class of the phoneme.
And step four, optimizing the existing speech recognition model according to the training target (namely the label with the labeled data) of the existing labeled data and the generated training target without the labeled data (namely the softmax weight vector of the phoneme).
And step five, repeating the step one to the step four once according to the updated voice recognition model, so that the non-labeled data can be trained by adopting a better labeled target.
The feature mapping network in the second step plays a role in recognizing acoustic models by certain voice, and the feature mapping network is combined with the voice recognition models to alternately train and update, so that a model fusion effect can be achieved.
The novel voice feature pre-training network based on the clustering criterion provided by the embodiment of the invention realizes the feature extraction network which can be generated and fused with the voice recognition target without marking; the semi-supervised speech recognition method based on the speech feature pre-training network can effectively utilize the non-labeled data.
Fig. 3 is a schematic diagram of main blocks of a speech recognition apparatus according to an embodiment of the present invention.
As shown in fig. 3, a speech recognition apparatus 300 according to an embodiment of the present invention mainly includes: a normalized weight vector determination module 301 and a speech recognition model training module 302.
The normalization weight vector determination module 301 is configured to extract a pre-training feature corresponding to the unlabeled first audio data sample through a feature extraction network, and obtain a normalization weight vector of a phoneme of the first audio data sample through a feature mapping network based on the pre-training feature corresponding to the first audio data sample, where the normalization weight vector represents a category of the phoneme of the first audio data sample;
the speech recognition model training module 302 is configured to train a speech recognition model using the first audio data sample and the second audio data sample with a normalized weight vector as a training target corresponding to the first audio data sample and a label of the labeled second audio data sample as a training target corresponding to the second audio data sample, and perform speech recognition using the trained speech recognition model, where the label of the second audio data sample represents a phoneme category of the second audio data sample.
The speech recognition device 300 may further include a feature mapping network training module to: extracting pre-training characteristics corresponding to the labeled third audio data sample through a characteristic extraction network; and training the feature mapping network by taking the pre-training feature corresponding to the third audio data sample as the input of the feature mapping network and taking the label of the third audio data sample as a training target, wherein the label of the third audio data sample represents the class of the phoneme of the third audio data sample.
The speech recognition apparatus 300 may further include a feature extraction network training module for: constructing training samples of a feature extraction network by using fourth audio data samples without labels, wherein each training sample is combined to obtain a training sample subset; inputting the training sample subset into a feature extraction network to obtain a network output result corresponding to each training sample in the training sample subset; clustering the network output results corresponding to the training samples to obtain a matching combination of the training samples and a clustering center, and updating the clustering center according to the matching combination; and taking a clustering criterion function as a loss function during the training of the feature extraction network, updating the network parameters of the feature extraction network through back propagation, wherein the clustering criterion function is constructed according to the network output result and the clustering center.
The loss function can be constructed as follows: taking the network output result of the ith training sample as the ith target sample yiIn addition to ckRepresenting distance to target sample yiAnd (3) constructing a first relation by the nearest cluster center: ith target sample yiAnd distance target sample yiNearest cluster center ckThe square of the absolute value of the difference, a second relation is constructed: calculating the sum of first relational expressions corresponding to the values of all i in the value range of i from 1 to M, wherein M is the number of training samples in a single training sample subset; and taking the second relation as a loss function.
The feature extraction network training module may include a training sample construction sub-module to: performing time-frequency transformation on the fourth audio data sample to obtain frame-level voice original features of the fourth audio data sample, wherein the frame-level voice original features comprise amplitude spectral vectors and phase spectral vectors of all frames of the fourth audio data sample; based on the frame-level original speech features, the context information of each frame of audio signal in the magnitude spectral vector and the phase spectral vector is fused to construct a training sample of the feature extraction network.
The training sample construction submodule is further configured to: and for any t-th frame voice original feature in the frame-level voice original features, splicing the magnitude spectral vectors and the phase spectral vectors of all frames from the t-D frame to the t + D frame according to a preset rule to obtain a training sample of the feature extraction network corresponding to the t-th frame, wherein D represents the window length of a preset context.
In addition, the specific implementation of the speech recognition device in the embodiment of the present invention has been described in detail in the above speech recognition method, and therefore, the repeated content will not be described again.
Fig. 4 shows an exemplary system architecture 400 to which the speech recognition method or the speech recognition apparatus of an embodiment of the present invention can be applied.
As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the terminal devices 401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The terminal devices 401, 402, 403 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 405 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 401, 402, 403. The background management server may analyze and otherwise process the received data, such as the audio information, and feed back a processing result (e.g., a speech recognition result, which is merely an example) to the terminal device.
It should be noted that the voice recognition method provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the voice recognition apparatus is generally disposed in the server 405.
It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing a terminal device or server of an embodiment of the present application is shown. The terminal device or the server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to embodiments of the present disclosure, the processes described above with reference to the main step schematic may be implemented as computer software programs. For example, the disclosed embodiments of the invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the main step diagram. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The above-described functions defined in the system of the present application are executed when the computer program is executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The principal step diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the main step diagrams or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or block diagrams, and combinations of blocks in the block diagrams or block diagrams, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a normalized weight vector determination module, a speech recognition model training module. For example, the normalized weight vector determination module may also be described as "a module for extracting a pre-training feature corresponding to the unlabeled first audio data sample through a feature extraction network, and obtaining a normalized weight vector of a phoneme of the first audio data sample through a feature mapping network based on the pre-training feature corresponding to the first audio data sample.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: extracting pre-training features corresponding to a first audio data sample without labels through a feature extraction network, and obtaining a normalization weight vector of a phoneme of the first audio data sample through a feature mapping network based on the pre-training features corresponding to the first audio data sample, wherein the normalization weight vector represents the category of the phoneme of the first audio data sample; and taking the normalized weight vector as a training target corresponding to the first audio data sample, taking a label of a labeled second audio data sample as a training target corresponding to the second audio data sample, training a speech recognition model by using the first audio data sample and the second audio data sample, and performing speech recognition by using the trained speech recognition model, wherein the label of the second audio data sample represents the category of phonemes of the second audio data sample.
According to the technical scheme of the embodiment of the invention, the problems of data dependence and voice characterization of voice recognition in various service fields and application scenes can be solved, a large amount of non-labeled audio data generated every day in the existing voice recognition product can be effectively utilized to improve the performance of the existing voice recognition, the manual labeling cost is reduced, the labeling time consumption is reduced, the labeling accuracy is improved, the voice recognition method and the voice recognition system are suitable for training of ultra-large-scale voice recognition, and the problems that voice phase information is ignored and the capability of modeling complex voice characteristics has defects in the prior art are solved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (14)

1. A speech recognition method, comprising:
extracting pre-training features corresponding to a first audio data sample without labels through a feature extraction network, and obtaining a normalization weight vector of a phoneme of the first audio data sample through a feature mapping network based on the pre-training features corresponding to the first audio data sample, wherein the normalization weight vector represents the category of the phoneme of the first audio data sample;
and taking the normalized weight vector as a training target corresponding to the first audio data sample, taking a label of a labeled second audio data sample as a training target corresponding to the second audio data sample, training a speech recognition model by using the first audio data sample and the second audio data sample, and performing speech recognition by using the trained speech recognition model, wherein the label of the second audio data sample represents the class of phonemes of the second audio data sample.
2. The method of claim 1, wherein before extracting the pre-training features corresponding to the unlabeled first audio data sample through the feature extraction network, the method comprises:
extracting pre-training features corresponding to the labeled third audio data sample through the feature extraction network;
and training the feature mapping network by taking a pre-training feature corresponding to the third audio data sample as an input of the feature mapping network and taking a label of the third audio data sample as a training target, wherein the label of the third audio data sample represents a category of a phoneme of the third audio data sample.
3. The method of claim 2, wherein before extracting the pre-training feature corresponding to the labeled third audio data sample by the feature extraction network, the method comprises:
constructing training samples of the feature extraction network by using fourth audio data samples without labels, wherein each plurality of training samples are combined to obtain a training sample subset;
inputting the training sample subset into the feature extraction network to obtain a network output result corresponding to each training sample in the training sample subset;
clustering the network output results corresponding to the training samples to obtain a matching combination of the training samples and a clustering center, and updating the clustering center according to the matching combination;
and updating the network parameters of the feature extraction network by back propagation by taking a clustering criterion function as a loss function during the feature extraction network training, wherein the clustering criterion function is constructed according to the network output result and the clustering center.
4. The method of claim 3, wherein the loss function is constructed by:
taking the network output result of the ith training sample as the ith target sample yiIn addition to ckRepresenting distance to target sample yiAnd (3) constructing a first relation by a nearest clustering center as follows: ith target sample yiAnd distance target sample yiNearest cluster center ckThe square of the absolute value of the difference, a second relation is constructed: calculating the sum of the first relational expression corresponding to the values of each i in the value range of i from 1 to M, wherein M is the number of training samples in a single training sample subset; and taking the second relation as the loss function.
5. The method of claim 3, wherein constructing the training samples of the feature extraction network using fourth unlabeled audio data samples comprises:
performing time-frequency transformation on the fourth audio data sample to obtain frame-level original voice features of the fourth audio data sample, wherein the frame-level original voice features comprise amplitude spectral vectors and phase spectral vectors of each frame of the fourth audio data sample;
based on the frame-level original voice features, fusing the context information of each frame of audio signal in the magnitude spectrum vector and the phase spectrum vector to construct a training sample of the feature extraction network.
6. The method according to claim 5, wherein the fusing the context information of each frame of audio signal in the magnitude spectrum vector and the phase spectrum vector based on the frame-level original speech features to construct a training sample of the feature extraction network comprises:
and for any t-th frame voice original feature in the frame-level voice original features, splicing magnitude spectral vectors and phase spectral vectors of all frames from the t-D frame to the t + D frame according to a preset rule to obtain a training sample of the feature extraction network corresponding to the t-th frame, wherein D represents the window length of a preset context.
7. A speech recognition apparatus, comprising:
the normalization weight vector determination module is used for extracting pre-training features corresponding to a first audio data sample without labels through a feature extraction network, and obtaining a normalization weight vector of a phoneme of the first audio data sample through a feature mapping network based on the pre-training features corresponding to the first audio data sample, wherein the normalization weight vector represents the category of the phoneme of the first audio data sample;
and the speech recognition model training module is used for taking the normalized weight vector as a training target corresponding to the first audio data sample, taking a label of a labeled second audio data sample as a training target corresponding to the second audio data sample, training a speech recognition model by using the first audio data sample and the second audio data sample, and performing speech recognition by using the trained speech recognition model, wherein the label of the second audio data sample represents the class of phonemes of the second audio data sample.
8. The apparatus of claim 7, further comprising a feature mapping network training module to:
extracting pre-training features corresponding to the labeled third audio data sample through the feature extraction network;
and training the feature mapping network by taking a pre-training feature corresponding to the third audio data sample as an input of the feature mapping network and taking a label of the third audio data sample as a training target, wherein the label of the third audio data sample represents a category of a phoneme of the third audio data sample.
9. The apparatus of claim 8, further comprising a feature extraction network training module to:
constructing training samples of the feature extraction network by using fourth audio data samples without labels, wherein each plurality of training samples are combined to obtain a training sample subset;
inputting the training sample subset into the feature extraction network to obtain a network output result corresponding to each training sample in the training sample subset;
clustering the network output results corresponding to the training samples to obtain a matching combination of the training samples and a clustering center, and updating the clustering center according to the matching combination;
and updating the network parameters of the feature extraction network by back propagation by taking a clustering criterion function as a loss function during the feature extraction network training, wherein the clustering criterion function is constructed according to the network output result and the clustering center.
10. The apparatus of claim 9, wherein the loss function is constructed by:
taking the network output result of the ith training sample as the ith target sample yiIn addition to ckRepresenting distance to target sample yiAnd (3) constructing a first relation by a nearest clustering center as follows: ith target sample yiAnd distance target sample yiNearest cluster center ckThe square of the absolute value of the difference, a second relation is constructed: calculating the sum of the first relational expression corresponding to the values of each i in the value range of i from 1 to M, wherein M is the number of training samples in a single training sample subset; and taking the second relation as the loss function.
11. The apparatus of claim 9, wherein the feature extraction network training module comprises a training sample construction sub-module configured to:
performing time-frequency transformation on the fourth audio data sample to obtain frame-level original voice features of the fourth audio data sample, wherein the frame-level original voice features comprise amplitude spectral vectors and phase spectral vectors of each frame of the fourth audio data sample;
based on the frame-level original speech features, fusing the context information of each frame of audio signal in the magnitude spectral vector and the phase spectral vector to construct a training sample of the feature extraction network.
12. The apparatus of claim 11, wherein the training sample construction sub-module is further configured to:
and for any t-th frame voice original feature in the frame-level voice original features, splicing magnitude spectral vectors and phase spectral vectors of all frames from the t-D frame to the t + D frame according to a preset rule to obtain a training sample of the feature extraction network corresponding to the t-th frame, wherein D represents the window length of a preset context.
13. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-6.
14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN202210312961.0A 2022-03-28 2022-03-28 Voice recognition method and device Pending CN114550702A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210312961.0A CN114550702A (en) 2022-03-28 2022-03-28 Voice recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210312961.0A CN114550702A (en) 2022-03-28 2022-03-28 Voice recognition method and device

Publications (1)

Publication Number Publication Date
CN114550702A true CN114550702A (en) 2022-05-27

Family

ID=81666010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210312961.0A Pending CN114550702A (en) 2022-03-28 2022-03-28 Voice recognition method and device

Country Status (1)

Country Link
CN (1) CN114550702A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115881131A (en) * 2022-11-17 2023-03-31 广州市保伦电子有限公司 Voice transcription method under multiple voices
CN115993503A (en) * 2023-03-22 2023-04-21 广东电网有限责任公司东莞供电局 Operation detection method, device and equipment of transformer and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115881131A (en) * 2022-11-17 2023-03-31 广州市保伦电子有限公司 Voice transcription method under multiple voices
CN115881131B (en) * 2022-11-17 2023-10-13 广东保伦电子股份有限公司 Voice transcription method under multiple voices
CN115993503A (en) * 2023-03-22 2023-04-21 广东电网有限责任公司东莞供电局 Operation detection method, device and equipment of transformer and storage medium

Similar Documents

Publication Publication Date Title
WO2021174757A1 (en) Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
CN107945786B (en) Speech synthesis method and device
CN110827805B (en) Speech recognition model training method, speech recognition method and device
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
WO2019174450A1 (en) Dialogue generation method and apparatus
CN111081280B (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
CN114550702A (en) Voice recognition method and device
CN112259089A (en) Voice recognition method and device
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
CN109697978B (en) Method and apparatus for generating a model
CN113450759A (en) Voice generation method, device, electronic equipment and storage medium
WO2020220824A1 (en) Voice recognition method and device
CN111625649A (en) Text processing method and device, electronic equipment and medium
CN113257283A (en) Audio signal processing method and device, electronic equipment and storage medium
CN110930975B (en) Method and device for outputting information
CN113689868B (en) Training method and device of voice conversion model, electronic equipment and medium
CN114399995A (en) Method, device and equipment for training voice model and computer readable storage medium
CN110675865B (en) Method and apparatus for training hybrid language recognition models
US11036996B2 (en) Method and apparatus for determining (raw) video materials for news
CN112017690A (en) Audio processing method, device, equipment and medium
CN114783409A (en) Training method of speech synthesis model, speech synthesis method and device
CN112652329B (en) Text realignment method and device, electronic equipment and storage medium
CN114512121A (en) Speech synthesis method, model training method and device
CN113689866A (en) Training method and device of voice conversion model, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination