CN113920987A - Voice recognition method, device, equipment and storage medium - Google Patents

Voice recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN113920987A
CN113920987A CN202111296840.3A CN202111296840A CN113920987A CN 113920987 A CN113920987 A CN 113920987A CN 202111296840 A CN202111296840 A CN 202111296840A CN 113920987 A CN113920987 A CN 113920987A
Authority
CN
China
Prior art keywords
domain
feature extraction
voice data
phonetic feature
source domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111296840.3A
Other languages
Chinese (zh)
Inventor
赵情恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111296840.3A priority Critical patent/CN113920987A/en
Publication of CN113920987A publication Critical patent/CN113920987A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/33Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using fuzzy logic
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

The present disclosure provides a method, an apparatus, a device and a storage medium for speech recognition, which relate to the field of artificial intelligence, and in particular to the field of speech technology. The specific implementation scheme is as follows: firstly, extracting the phonetic features of target domain voice data by using a phonetic feature extraction model of a domain confrontation, wherein the phonetic feature extraction model of the domain confrontation comprises a phonetic feature extraction part and a domain classification part, and the domain classification part introduces the confrontation through a gradient inversion layer to blur the domain characteristics of the linguistic features; then, extracting the voice recognition target characteristics of the voice data of the target domain; then, a speech recognition result of the target speech data is determined jointly based on the phonetic feature and the speech recognition target feature. Therefore, the influence of the characteristics of the speaker, such as age, gender, timbre and the like, can be reduced better, so that the voice recognition is more accurate, and the voice can be more easily transferred from one field to another field.

Description

Voice recognition method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the field of speech technology, and further relates to a method, an apparatus, a device, and a storage medium for speech recognition.
Background
Due to the influence of environmental noise, accent difference, speaking style difference, theme deviation, speakers and other factors, the speech recognition model is easy to make mistakes.
In addition, when a speech recognition model is migrated from one domain to another domain, a large amount of label data of a target domain is often required. For the field without labeled data, a lot of manpower and time are also needed for labeling the data.
Disclosure of Invention
The present disclosure provides a method, apparatus, device, and storage medium for speech recognition.
According to an aspect of the present disclosure, there is provided a method of speech recognition, including: acquiring target domain voice data; extracting the phonetic features of the target domain voice data by using a phonetic feature extraction model of the domain confrontation, wherein the phonetic feature extraction model of the domain confrontation comprises a phonetic feature extraction part and a domain classification part, and the domain classification part introduces the confrontation through a gradient reverse layer to blur the domain characteristics of the linguistic features; extracting voice recognition target characteristics of the target domain voice data; and determining a voice recognition result of the target voice data according to the phonetic features and the voice recognition target features.
According to another aspect of the present disclosure, there is provided an apparatus for speech recognition, the apparatus including: the target domain voice data acquisition module is used for acquiring target domain voice data; the phonetic feature extraction model is used for extracting phonetic features of target domain voice data by using the phonetic feature extraction model of the domain confrontation, the phonetic feature extraction model of the domain confrontation comprises a phonetic feature extraction part and a domain classification part, and the domain classification part introduces the confrontation through a gradient inversion layer so as to blur the domain characteristics of the linguistic features; the voice recognition target feature extraction module is used for extracting voice recognition target features of the target domain voice data; and the voice recognition result determining module is used for determining the voice recognition result of the target voice data according to the phonetic features and the voice recognition target features.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the speech recognition methods described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform any of the above-described methods of speech recognition.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements any of the speech recognition methods described above.
The present disclosure provides a method, an apparatus, a device and a storage medium for speech recognition, which can better reduce the influence of the speaker's own characteristics, such as age, gender, timbre, etc., by extracting linguistic features, so that the speech recognition is more accurate. In addition, the linguistic feature extraction model used in the method also comprises a domain classification part and introduces countermeasures through a gradient reverse layer, so that the domain characteristics of the linguistic features can be blurred, and the speech recognition method can be more easily migrated from one domain to another domain.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flowchart illustrating a method for implementing speech recognition according to a first embodiment of the present disclosure;
FIG. 2 is a schematic flow chart of training a phonetic feature extraction model according to a first embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a model framework for a second embodiment of the phoneme recognition application of the present disclosure;
FIG. 4 is a flowchart illustrating a method for implementing speech recognition according to a second embodiment of the present disclosure;
FIG. 5 is a schematic network structure diagram of a phonetic feature extraction model according to a second embodiment of the present disclosure;
FIG. 6 is a schematic flow chart of training a phonetic feature extraction model according to a second embodiment of the present disclosure;
FIG. 7 is a schematic representation of the pronunciation phonetic label of the second embodiment of the present disclosure in comparison with a portion of factors;
fig. 8 is a schematic structural diagram of a speech recognition apparatus according to a first embodiment of the present disclosure;
fig. 9 is a block diagram of an electronic device for implementing the method of speech recognition of the present embodiment.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments are included to assist understanding, and which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Due to the model-based speech recognition approach, it depends largely on the algorithm of the model and the data used to train the model. Therefore, when a model trained based on speech data of an existing domain is applied to an unknown domain, the prediction accuracy often decreases significantly.
The fields mentioned in this disclosure are wide fields including speech sources, application scenarios, or professional fields. The speech data comprises audio or feature data extracted from the audio, particularly adapted to audio containing live conversations.
The speech recognition method provided by the present disclosure is particularly suitable for a scenario of migrating a speech recognition method from a domain language in which the method has been successfully applied to a domain in which the method has not been applied. Among them, the field in which this method has been successfully applied is the source domain; a field in which this method has not been applied is the target domain.
Fig. 1 shows a schematic implementation flow diagram of a first embodiment of the speech recognition method of the present disclosure. As shown in fig. 1, the method includes: operation 110, acquiring target domain voice data; operation 120, extracting phonetic features of the target domain speech data using a phonetic feature extraction model of the domain confrontation, where the phonetic feature extraction model of the domain confrontation includes a phonetic feature extraction part and a domain classification part, and the domain classification part introduces the confrontation through a gradient inversion layer to blur domain characteristics of the linguistic features; operation S130, extracting a voice recognition destination feature of the target domain voice data; in operation S140, a voice recognition result of the target voice data is determined according to the phonetic feature and the voice recognition target feature.
In operation S110, the target domain voice data is consistent in data format or form with the source domain voice data. But often have different characteristics than the source domain voice data because of the different domains of application. For example, speech sources are changing from robot-based to human-based; or, the application scene is migrated from the commodity recommendation to the knowledge question and answer; or, the specialized vocabulary is converted from the electronic domain to the mechanical domain, etc.
In operation S120, the phonetic features mainly refer to pronunciation linguistic features, that is, rules summarized according to the phenomenon that pronunciation organs coordinate with each other to generate pronunciation, such as tongue height, front and back, lip circle and stretch, and chin rise and fall, to distinguish various utterances. Often unaffected by personal characteristics (e.g., age, gender, timbre, etc.).
The linguistic feature extraction model of the domain confrontation is used for training the model through training data of different domains, and during training, the confrontation training among the domains is used for blurring the difference among the domains. In the process of training, the training data samples are disturbed by mixing some countermeasure samples (which have small changes but are likely to cause misclassification), and then the neural network is adapted to the changes, so that the robustness to the countermeasure samples is achieved. In this way, when linguistic features are extracted, features with a small degree of association with the domain are extracted as much as possible.
In the embodiment, the domain classification part is introduced into the phonetic feature extraction model of the domain confrontation, and the confrontation is introduced into the gradient inversion layer in the domain classification part, so that the confrontation training mode has higher efficiency and can be converged more quickly.
In operation S130, the feature for speech recognition purpose refers to a feature that needs to be extracted for speech recognition purpose. If the speech recognition purpose is phoneme recognition, the speech recognition purpose feature may be a sound spectrum feature or the like associated with the phoneme; if the speech recognition purpose is intent recognition, the speech recognition purpose features may be tone features, semantic features, etc. associated with the sememes.
In operation S140, the phonetic features extracted in operation S120 are introduced into the speech recognition process, and the target speech recognition features and the phonetic features are combined to determine the speech recognition result, so that the situations of poor applicability, reduced recognition accuracy and the like which may occur when the method is applied to a target domain in the original speech recognition method due to different characteristics, sound sources, application scenarios or professional fields of the speaker can be greatly reduced.
It should be noted that the phonetic feature extraction model used in the operation S120 is often a model that can be actually applied after being trained.
Before the method is applied to a target domain, a preset domain confrontation phonetic feature extraction model needs to be trained to achieve certain model precision and recognition accuracy, and then the practically applicable domain confrontation phonetic feature extraction model can be obtained.
The preset phonetic feature extraction model of the domain confrontation refers to a phonetic feature extraction model of the domain confrontation to be trained, and a model algorithm of the phonetic feature extraction model, an initial value of a network parameter and the like are preset.
The training process mainly comprises the following steps: inputting the training data into a preset phonetic feature extraction model of the field confrontation to obtain corresponding output, calculating a loss function according to the corresponding output and a label marked on the training data, and then updating the network parameters of the preset phonetic feature extraction model of the field confrontation according to the loss function. And repeating the iteration continuously until the model converges and reaches the preset model precision and accuracy, thus obtaining the phonetic feature extraction model capable of being practically applied in the field confrontation.
Fig. 2 shows the embodiment, and the main operations of training the preset domain confrontation phonetic feature extraction model mainly include: operation S210, acquiring first source domain voice data, where the first source domain voice data is labeled with a phonetic feature label; operation S220, acquiring first non-source domain voice data; operation S230 is performed to perform countermeasure training on a preset domain-countermeasure phonetic feature extraction model using the first source domain speech data and the first non-source domain speech data, so as to obtain a domain-countermeasure phonetic feature extraction model. Since, when applying the speech recognition method in the source domain, the source domain data is often already labeled, wherein corresponding phonetic features are usually labeled. Accordingly, in operation S210, existing training data may be used as source domain speech data. For purposes of distinguishing from source domain speech data labeled with other labels for training other models, it is referred to herein, temporarily, as first source domain speech data
In operation S220, non-source domain voice data refers to voice data belonging to a domain different from the source domain. Therefore, the purpose of blurring the characteristics of the field can be achieved through the field confrontation training.
Generally, the preset domain confrontation phonetic feature extraction model is only trained by the source domain voice data, but not by the non-source domain voice data. Therefore, most of the non-source domain areas in this embodiment are voice data without labels. For purposes of distinguishing from non-source domain speech data used to train other models, it is referred to herein, temporarily, as first non-source domain data
To further blur the boundaries of the domain, the domain to which the first non-source domain speech data belongs may even be a domain different from the target domain.
However, when the proportion of the speech data in the domain different from the target domain is too large, it may be disadvantageous to improve the applicability and accuracy of the model in the target domain.
Therefore, the non-source domain voice data taking the target domain voice data as the main is recommended to be used, so that the characteristics of the target domain can be better adapted, and a more accurate recognition result can be obtained when the recognition method is applied to the target domain.
If it is necessary to use some non-source domain voice data different from the target domain to further obscure the domain characteristics, the data amount of the non-source domain voice data different from the target domain may be controlled to a relatively low proportion as much as possible, and a domain close to the target domain may be selected as much as possible.
In operation S230, a domain confrontation training is performed on the domain classification part using the voice data of the target domain and the voice data of the source domain, so that the discrimination network in the confrontation network cannot finally recognize the domain to which the voice data belongs, and thus the domain classification obtained by the domain classification part cannot truly reflect the domain to which the voice data belongs, thereby blurring the domain characteristics of the voice data.
After the training process is completed, the model may also be validated using test data.
It should be noted that the embodiments shown in fig. 1 and fig. 2 are only basic embodiments for implementing the speech recognition method of the present disclosure, and an implementer may further refine and expand the embodiments and apply the embodiments in various application scenarios.
Fig. 3 to 7 show another embodiment of applying the speech recognition method of the present disclosure to phoneme recognition. The present embodiment may be used to identify phonemes in the input audio to further determine which words are included in the audio.
In this embodiment, phonemes are recognized by a phoneme recognition model, and when determining a speech recognition result of target speech data according to phonetic features and speech recognition target features, a model framework shown in fig. 3 is adopted, and a specific flow is shown in fig. 4, and includes:
step S4010, receiving an audio 30;
the audio 30 is typically pre-processed to remove noise (including ambient noise, busy tones, ringing tones, etc.) and data enhancement (including changing speech rates, mixing echoes, etc.) to simplify subsequent processing and reduce the impact of noisy data on subsequent data processing, thereby allowing more accurate and representative features to be extracted therefrom.
Step S4020, inputting the received audio 30 into a sound spectrum feature extraction model 31, and extracting sound spectrum features to extract sound spectrum features;
step S4030, the extracted acoustic spectrum features are input to the phonetic feature extraction model 32 to obtain phonetic features corresponding to the acoustic spectrum features;
the phonetic feature extraction model 32 is a phonetic feature extraction model of a domain confrontation that has undergone domain confrontation training, and has a structure as shown in fig. 5.
The phonetic feature extraction model 32 includes a shared input layer 321 and a common feature extraction layer 322, which then access an upper branch phonetic feature extraction layer 323 and a lower branch domain classification layer 324, respectively.
The shared input layer 321 and the common feature extraction layer 322 are both a multi-layer time-delay network (TDNN) and include a sigmod activation layer.
The input of the upper branch phonetic feature extraction layer 323 is the sound spectrum feature of each frame of audio, such as MFCC, PLP, Fbank, etc., the dimension can be 20, the time length of each frame is 25ms, the frame is shifted by 10ms, and a certain number of front and rear frames, such as 2 frames each, are generally provided as context during calculation; the output is then the probability (typically corresponding to multiple labels) of the phonetic features corresponding to the input frame audio.
The input of the down-branch domain classification layer 324 is the same as that of the phonetic feature extraction layer 323, and the output is the domain probability corresponding to the input frame audio (it can be assumed that the source domain label is 0 and the target domain label is 1).
Here, in training a preset phonetic feature extraction model to obtain the phonetic feature extraction model 32, the loss function used is Cross Entropy (CE).
In addition, the lower branch introduces domain adaptation (domain adaptation) through the gradient inversion layer (the part indicated by the dashed arrow in fig. 5) to blur the boundaries between domains, leaving the network without the ability to distinguish the domains.
It should be noted that the phonetic feature extraction model 32 is obtained by pre-training a preset phonetic feature extraction model, and then serves as a phonetic feature extractor of the phoneme recognition module. In the process of unsupervised or semi-supervised training of the preset phonetic feature extraction model to obtain the phonetic feature extraction model 32, the source domain data and a small amount of target domain data labeled with phonetic features are respectively input to the upper branch phonetic feature extraction layer and the lower branch domain classification layer, and the target domain data not labeled with phonetic features is only input to the lower branch domain classification layer.
Therefore, on the premise of not influencing the accuracy of the phonetic features, the phonetic feature extraction model 32 does not extract features in the field any more, and reduces interference on phonetic feature extraction, so that the accuracy of subsequent phoneme recognition can be improved.
Step S4040, the acoustic spectrum features extracted by the acoustic spectrum feature extraction model 31 are spliced with the phonetic features corresponding to the acoustic spectrum features output by the phonetic feature extraction model 32 to obtain a model input vector;
step S4050 inputs the model input vector to the phoneme recognition model 33, and performs phoneme recognition to obtain a phoneme and a probability 34 thereof.
The phoneme recognition model 33 is formed by a 2-layer delay network (DNN), the input is an input vector obtained by splicing the above-mentioned sound spectrum feature and a phonetic feature corresponding to the feature, and the output is each phoneme and its probability.
In general, the phoneme recognition module 33 that can be directly applied to the target domain can be obtained only by training the preset phoneme recognition module with the source domain audio data labeled with the acoustic spectrum features. If the target domain has a small amount of target domain data labeled with the sound spectrum features, the preset phoneme recognition model can be optimized by using the target domain data labeled with the sound spectrum features to obtain the phoneme recognition model 33 with better effect.
In training a preset phoneme recognition model to obtain the phoneme recognition model 33, the cross entropy may be used as a loss function.
Step S4060, the phoneme recognition result is determined according to the phoneme and the probability 34 thereof, and the phoneme with the highest probability is generally used.
In this embodiment, since both the phoneme recognition model 33 and the phonetic feature extraction model 32 can use the sound spectrum feature as the model input, the sound spectrum feature extraction model 31 can be shared, and the implementation is simplified.
Moreover, the input vectors spliced by the phonetic features corresponding to the phonetic features and the sound spectrum features are used to jointly determine the recognition results of the phonemes, so that the deviation caused by the difference of the characteristics or the domains of the speakers can be ignored to the greatest extent during the phoneme recognition, and the phoneme recognition system shown in fig. 3 can be more easily moved to different domains and applied in different application scenes. Particularly, due to the high accuracy of the phoneme recognition method, the phoneme recognition method has more obvious advantages in application scenarios requiring real-time recognition, such as a scenario of converting speech into text in real time as a dialog record and a scenario of converting one language into another language in real time.
As mentioned above, the application effect of a model is also indistinguishable from its previous training process. Fig. 6 shows the phoneme recognition application of fig. 3, and the training process performed before the application mainly includes:
step S6010: collect a certain amount (e.g., 10 ten thousand +) of source domain audio and corresponding annotations (phonemes and phonetic labels);
step S6020, pre-processing the audio, including removing noise (including ambient noise, busy tone, color ring tone, etc.), enhancing data (including changing voice rate, mixing echo, etc.);
step S6030, collecting a certain amount (10 ten thousand +) of target domain audio, wherein no labeling or a small amount of labeling can be performed;
in this embodiment, the real data of the target domain is used for training. Thus, the trained model can be better suitable for the target domain, and the voice recognition result is more accurate. Step S6040, training a preset phonetic feature extraction model to obtain a phonetic feature extraction model 32;
the preset acoustic feature extraction model is an acoustic feature extraction model to be trained, a model algorithm and initial values of network parameters are preset, and the acoustic feature extraction model 32 which can be practically applied and is shown in fig. 3 can be obtained after training.
Inputting the source domain audio processed in step S6020 into the acoustic spectrum feature extraction model 31, extracting acoustic spectrum features, then inputting the extracted acoustic spectrum features into a preset phonetic feature extraction model, setting the domain label of the source domain audio to "0", and inputting into a dual branch: a phonetic feature extraction layer and a domain classification layer.
Setting the domain label of the target domain audio to be 1, and inputting the domain label into a lower branch domain classification layer; if the label exists, the label can be input into the upper-branch phonetic feature extraction layer, so that the upper-branch phonetic feature extraction layer can be optimized for the target domain.
And then, calculating the loss of the network of the preset phonetic feature extraction model according to the loss function, and reversely updating network parameters according to a random gradient descent criterion, so that iteration is repeated for a plurality of times until the network converges, and a phonetic feature extractor after the field confrontation is obtained.
For target domain audio without phonetic feature tags, there is often only a "domain" tag and no other content tags. Thus, whether to pass it to the phonetic feature extraction layer of the upper branch can be determined by checking whether the input audio has a phonetic feature tag.
Therefore, the training data containing a large amount of source domain audios and a large amount of target domain audios can be used for carrying out domain confrontation training on the lower branch domain classification layer, finally, the discrimination domain in the confrontation network cannot distinguish the domain to which the input audio belongs, and therefore the extracted domain features do not have real domain characteristics and are fuzzified.
Step S6050: training a preset phoneme recognition model to obtain a phoneme recognition model 33;
the preset phoneme recognition model is a phoneme recognition model to be trained, a model algorithm and initial values of network parameters are preset, and the phoneme recognition model 33 shown in fig. 3 which can be practically applied can be obtained after training. For the unsupervised situation, the source domain audio processed in step S6020 is input into the acoustic spectrum feature extraction model 31 to extract acoustic spectrum features, the phonetic features are obtained through the phonetic feature extraction model 32, then the source domain audio and the phonetic features are spliced and input into the phoneme recognition model 33, the phoneme recognition model 33 is trained, and iteration is performed for many times until convergence.
For the semi-supervised case, the preset phoneme recognition model can be trained by using the source domain audio, and then the preset phoneme recognition model can be trained by using the target domain audio for tuning. Therefore, the phoneme recognition model 33 obtained after training has better applicability in the target domain and more accurate recognition result.
When the phoneme recognition model 33 obtained after the training of the training data and the training process is used for phoneme recognition, the influence caused by different characteristics of personal characteristics and fields can be reduced, and the phoneme recognition model is easier to migrate from the source domain to the target domain.
If the pronunciation system may also change when migrating from the source domain to the target domain, the source domain audio may be used to train the pre-set phoneme recognition model 33 to obtain a basic phoneme recognition module.
Then, the comparison table of the partial pronunciation phonetic label and phoneme of the source domain shown in fig. 7 is replaced by the comparison table of the pronunciation phonetic label and phoneme of the target domain, and then training is performed by using the target domain audio.
The pronunciation system mainly refers to different language systems, such as mandarin, cantonese, japanese, korean, or english. Different languages have different pronunciation rules, so the corresponding pronunciation phonetic labels and the comparison table of partial factors are different.
The phoneme recognition model 33 obtained by the training in the above method can be applied to not only the domain of the same pronunciation system, but also a target domain with a pronunciation system different from the source domain.
It should be noted that, in the embodiment shown in fig. 3, the embodiment is applied in a speech recognition scenario in which phoneme recognition is used as a recognition purpose, and the phoneme recognition model generally receives a sound spectrum feature as an input, so in the embodiment, the phonetic feature extraction model 32 and the phoneme recognition module can share the sound spectrum feature extraction model 31 as an input, thereby simplifying the operation process to the greatest extent and saving the operation force.
However, in other application scenarios or other embodiments, for example, when the speech recognition method of the present disclosure is applied in a speech recognition scenario with intent recognition as a recognition purpose, the spectral feature extraction module may also be embedded in the phonetic feature extraction model, and the input speech data may be divided into two paths and input into the phonetic feature extraction module and the intent recognition module, respectively.
In addition, in the present embodiment, when the phonetic feature extraction model 32 and the phoneme recognition model 33 are trained, in the source domain speech data and a small amount of target domain speech data, the phonetic feature and the phoneme label (the speech recognition target label) are labeled at the same time so that the degree of fit between the two models is higher, and therefore, the first source domain speech data and the second source domain speech data are not distinguished, and the first target domain speech data and the second target domain speech data are not distinguished.
In practice, however, different training data sets, each labeled with phonetic and spectral features, may be used, particularly if the existing data does not meet the above conditions.
In addition, for the target domain voice data, it is preferable to use the real data of the target domain, but if the real data of the target domain is less, some simulated training data simulating the real data of the target domain may be generated according to the characteristics of the target domain, but the accuracy and applicability thereof may be affected.
Further, the present disclosure also provides a speech recognition apparatus, as shown in fig. 8, the apparatus 80 includes: a target domain voice data obtaining module 801, configured to obtain target domain voice data; the phonetic feature extraction model 802 is used for extracting phonetic features of target domain voice data by using a phonetic feature extraction model of the domain confrontation, the phonetic feature extraction model of the domain confrontation comprises a phonetic feature extraction part and a domain classification part, and the domain classification part introduces the confrontation through a gradient inversion layer to blur the domain characteristics of the linguistic features; a voice recognition target feature extraction module 803, which extracts the voice recognition target features of the target domain voice data; and the voice recognition result determining module 804 is configured to determine a voice recognition result of the target voice data according to the phonetic features and the voice recognition target features.
According to another embodiment of the present disclosure, the apparatus 80 further comprises: the system comprises a first source domain voice data acquisition module, a second source domain voice data acquisition module and a third source domain voice data acquisition module, wherein the first source domain voice data acquisition module is used for acquiring first source domain voice data which is marked with a phonetic feature label; the first non-source domain voice data acquisition module is used for acquiring first non-source domain voice data; and the phonetic feature extraction model training module is used for training a preset phonetic feature extraction model of the field confrontation by using the first source domain voice data and the first non-source domain voice data to obtain a phonetic feature extraction model of the field confrontation for training.
According to another embodiment of the present disclosure, wherein the phonetic feature extraction model training module is further configured to: setting a domain label of first source domain voice data as a first domain value representing a source domain, inputting the first domain value into a phonetic feature extraction part and a domain classification part of a preset domain confrontation phonetic feature extraction model, setting the domain label of first non-source domain voice data as a corresponding domain value representing a non-source domain, inputting the corresponding domain value into a domain classification part of the preset domain confrontation phonetic feature extraction model, and updating network parameters of the preset domain confrontation phonetic feature extraction model according to an output result of the preset domain confrontation phonetic feature extraction model.
According to another embodiment of the present disclosure, the first non-source domain speech data further includes non-source domain speech data labeled with a phonetic label, and accordingly, the phonetic feature extraction model training module is further configured to: inputting the first non-source domain voice data marked with the phonetic label into a phonetic feature extraction part of a preset field confrontation phonetic feature extraction model; based on the output result of the phonetic feature extraction part, the network parameters of the phonetic feature extraction model of the field confrontation are updated.
According to another embodiment of the present disclosure, wherein the voice recognition result determining 804 module includes: the model input vector synthesis module is used for synthesizing a model input vector according to the phonetic features and the speech recognition target features; the voice recognition module is used for inputting the model input vector into the voice recognition model to obtain a model output result; and the voice recognition result determining module is used for determining the voice recognition result of the target voice data according to the model output result.
According to another embodiment of the present disclosure, the apparatus 80 further comprises: the second source domain voice data acquisition module is used for acquiring second source domain voice data, and the second source domain voice data is marked with a first voice recognition destination label; and the voice recognition model training module is used for training a preset voice recognition model by using second source domain voice data so as to obtain the voice recognition model.
According to another embodiment of the present disclosure, the apparatus 80 comprises: the second non-source domain voice data acquisition module is used for acquiring second non-source domain voice data, and the second non-source domain voice data is marked with a second voice recognition target label; and the voice recognition model training module is also used for training a preset voice recognition model by using the second non-source domain voice data and the second source domain voice data.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a method of speech recognition. For example, in some embodiments, the method of speech recognition may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the method of speech recognition described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform a method of speech recognition.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (21)

1. A method of speech recognition, comprising:
acquiring target domain voice data;
extracting phonetic features of the target domain speech data using a phonetic feature extraction model of a domain confrontation,
the phonetic feature extraction model of the domain confrontation comprises a phonetic feature extraction part and a domain classification part, wherein the domain classification part introduces confrontation through a gradient inversion layer to blur the domain characteristics of the linguistic features;
extracting voice recognition target characteristics of the target domain voice data;
and determining a voice recognition result of the target voice data according to the phonetic features and the voice recognition target features.
2. The method of claim 1, further comprising:
acquiring first source domain voice data, wherein the first source domain voice data is marked with a phonetic feature label;
acquiring first non-source domain voice data;
and performing field confrontation training on a preset field confrontation phonetic feature extraction model by using the first source domain voice data and the first non-source domain voice data to obtain the field confrontation phonetic feature extraction model.
3. The method of claim 2, wherein the training of the preset domain confrontation phonetic feature extraction model using the first source domain speech data and first non-source domain speech data comprises:
setting a domain label of the first source domain voice data as a first domain value representing a source domain, and inputting the first domain value into a phonetic feature extraction part and a domain classification part of a preset domain confrontation phonetic feature extraction model;
setting the domain label of the first non-source domain voice data as a second domain value representing a non-source domain, and inputting the second domain value into a domain classification part of the preset domain confrontation phonetic feature extraction model;
and updating the network parameters of the preset domain confrontation phonetic feature extraction model according to the output result of the preset domain confrontation phonetic feature extraction model.
4. The method of claim 3, wherein the first non-source domain speech data further comprises non-source domain speech data labeled with a phonetic tag, accordingly the method further comprises:
inputting the first non-source domain voice data marked with the phonetic label into a phonetic feature extraction part of the preset domain confrontation phonetic feature extraction model;
and updating the network parameters of the preset domain confrontation phonetic feature extraction model based on the output result of the phonetic feature extraction part.
5. The method of claim 1, wherein determining the speech recognition result of the target speech data according to the phonetic features and the speech recognition purpose features comprises:
synthesizing a model input vector according to the phonetic features and the speech recognition target features;
inputting the model input vector into a voice recognition model to obtain a model output result;
and determining a voice recognition result of the target voice data according to the model output result.
6. The method of claim 5, further comprising:
acquiring second source domain voice data, wherein the second source domain voice data is marked with a first voice recognition target label;
and training a preset voice recognition model by using the second source domain voice data to obtain the voice recognition model.
7. The method of claim 6, further comprising:
acquiring second non-source domain voice data, wherein the second non-source domain voice data is marked with a second voice recognition target label;
correspondingly, the training of the preset speech recognition model by using the second source domain speech data includes:
training the preset speech recognition model using the second non-source domain speech data and the second source domain speech data.
8. The method of any of claims 2-4, 7, wherein the non-source domain voice data belongs to target domain voice data.
9. The method of any one of claims 1, 5, 8, wherein the target domain comprises a target domain having a different pronunciation hierarchy than the source domain.
10. The method according to any of claims 1, 5-7, wherein the speech recognition purpose features are features characterizing phoneme recognition.
11. The method of claim 10, the speech recognition purpose feature being a sonographic feature.
12. An apparatus for speech recognition, comprising:
the target domain voice data acquisition module is used for acquiring target domain voice data;
a phonetic feature extraction model for extracting phonetic features of the target domain speech data using a phonetic feature extraction model of a domain confrontation, the phonetic feature extraction model of the domain confrontation comprising a phonetic feature extraction part and a domain classification part, the domain classification part introducing the confrontation through a gradient inversion layer to blur domain characteristics of the phonetic features;
the voice recognition target feature extraction module is used for extracting voice recognition target features of the target domain voice data;
and the voice recognition result determining module is used for determining the voice recognition result of the target voice data according to the phonetic features and the voice recognition target features.
13. The apparatus of claim 12, further comprising:
the system comprises a first source domain voice data acquisition module, a second source domain voice data acquisition module and a voice recognition module, wherein the first source domain voice data acquisition module is used for acquiring first source domain voice data, and the first source domain voice data is marked with a phonetic feature label;
the first non-source domain voice data acquisition module is used for acquiring first non-source domain voice data;
and the phonetic feature extraction model training module is used for training a preset phonetic feature extraction model of the field confrontation by using the first source domain voice data and the first non-source domain voice data to obtain the phonetic feature extraction model of the field confrontation.
14. The apparatus of claim 13, wherein the phonetic feature extraction model training module is further configured to:
setting a domain label of the first source domain voice data as a first domain value representing a source domain, inputting the first domain value into a phonetic feature extraction part and a domain classification part of a phonetic feature extraction model of the preset domain confrontation,
setting a domain label of the first non-source domain voice data as a second domain value representing a non-source domain, and inputting the second domain value into a domain classification part of a phonetic feature extraction model of the domain confrontation;
and updating the network parameters of the preset domain confrontation phonetic feature extraction model according to the output result of the preset domain confrontation phonetic feature extraction model.
15. The apparatus of claim 14, wherein the first non-source domain speech data further comprises non-source domain speech data labeled with a phonetic label, and accordingly the phonetic feature extraction model training module is further configured to:
inputting the first non-source domain voice data marked with the phonetic label into a phonetic feature extraction part of the preset domain confrontation phonetic feature extraction model;
and updating the preset phonetic feature extraction model of the field confrontation for training based on the output result of the phonetic feature extraction part.
16. The apparatus of claim 12, wherein the speech recognition result determination module comprises:
the model input vector synthesis module is used for synthesizing a model input vector according to the phonetic features and the speech recognition target features;
the voice recognition module is used for inputting the model input vector into a voice recognition model to obtain a model output result;
and the voice recognition result determining module is used for determining the voice recognition result of the target voice data according to the model output result.
17. The apparatus of claim 16, further comprising:
the second source domain voice data acquisition module is used for acquiring second source domain voice data, and the second source domain voice data is marked with a first voice recognition destination label;
and the voice recognition model training module is used for training a preset voice recognition model by using the second source domain voice data so as to obtain the voice recognition model.
18. The apparatus of claim 17, further comprising:
the second non-source domain voice data acquisition module is used for acquiring second non-source domain voice data, and the second non-source domain voice data is marked with a second voice recognition target label;
and the voice recognition model training module is further used for training the preset voice recognition model by using the second non-source domain voice data and the second source domain voice data.
19. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.
20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-11.
21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-11.
CN202111296840.3A 2021-11-03 2021-11-03 Voice recognition method, device, equipment and storage medium Pending CN113920987A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111296840.3A CN113920987A (en) 2021-11-03 2021-11-03 Voice recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111296840.3A CN113920987A (en) 2021-11-03 2021-11-03 Voice recognition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113920987A true CN113920987A (en) 2022-01-11

Family

ID=79245100

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111296840.3A Pending CN113920987A (en) 2021-11-03 2021-11-03 Voice recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113920987A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116884392A (en) * 2023-09-04 2023-10-13 浙江鑫淼通讯有限责任公司 Voice emotion recognition method based on data analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116884392A (en) * 2023-09-04 2023-10-13 浙江鑫淼通讯有限责任公司 Voice emotion recognition method based on data analysis
CN116884392B (en) * 2023-09-04 2023-11-21 浙江鑫淼通讯有限责任公司 Voice emotion recognition method based on data analysis

Similar Documents

Publication Publication Date Title
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
US10210861B1 (en) Conversational agent pipeline trained on synthetic data
JP7266683B2 (en) Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction
CN112771607A (en) Electronic device and control method thereof
CN114416934B (en) Multi-modal dialog generation model training method and device and electronic equipment
CN112397056B (en) Voice evaluation method and computer storage medium
CN113707125B (en) Training method and device for multi-language speech synthesis model
CN109697978B (en) Method and apparatus for generating a model
JP6875819B2 (en) Acoustic model input data normalization device and method, and voice recognition device
CN113450759A (en) Voice generation method, device, electronic equipment and storage medium
CN114360557A (en) Voice tone conversion method, model training method, device, equipment and medium
CN114171002A (en) Voice recognition method and device, electronic equipment and storage medium
CN113160820A (en) Speech recognition method, and training method, device and equipment of speech recognition model
CN113920987A (en) Voice recognition method, device, equipment and storage medium
JP7314450B2 (en) Speech synthesis method, device, equipment, and computer storage medium
CN114999463B (en) Voice recognition method, device, equipment and medium
EP4024393A2 (en) Training a speech recognition model
CN114898734A (en) Pre-training method and device based on speech synthesis model and electronic equipment
JP2020134719A (en) Translation device, translation method, and translation program
US20230081543A1 (en) Method for synthetizing speech and electronic device
CN117275458B (en) Speech generation method, device and equipment for intelligent customer service and storage medium
CN113555006B (en) Voice information identification method and device, electronic equipment and storage medium
CN113327596B (en) Training method of voice recognition model, voice recognition method and device
CN114078475B (en) Speech recognition and updating method, device, equipment and storage medium
CN113793598B (en) Training method of voice processing model, data enhancement method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination