CN114171002A - Voice recognition method and device, electronic equipment and storage medium - Google Patents

Voice recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114171002A
CN114171002A CN202111550980.9A CN202111550980A CN114171002A CN 114171002 A CN114171002 A CN 114171002A CN 202111550980 A CN202111550980 A CN 202111550980A CN 114171002 A CN114171002 A CN 114171002A
Authority
CN
China
Prior art keywords
language
voice
speech
features
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111550980.9A
Other languages
Chinese (zh)
Inventor
祁鹏
许丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202111550980.9A priority Critical patent/CN114171002A/en
Publication of CN114171002A publication Critical patent/CN114171002A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase

Abstract

The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: performing language identification on a voice to be identified to obtain language characteristics of the voice to be identified; and performing voice decoding on the coding features of the voice to be recognized based on the language features to obtain recognition texts of the voice to be recognized respectively in a voice language and a preset language, wherein the voice language is the language indicated by the language features. According to the method, the device, the electronic equipment and the storage medium provided by the invention, the voice decoding of the voice language and the voice decoding of the preset language are parallel, translation is not needed on the basis of recognizing the text by the voice language, the accuracy of recognizing the text by the preset language is effectively improved, and the response time of voice recognition is shortened. The speech language and the speech decoding of the preset speech share the coding characteristics of the speech to be recognized, namely, a unified modeling mode is provided, so that the deployment is more flexible, and the deployment and maintenance cost can be effectively reduced.

Description

Voice recognition method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.
Background
The voice recognition technology is used as one of important interfaces of human-computer interaction, so that more convenient experience is brought to a user, and the interaction threshold of a human and a machine is reduced. However, the complexity of the language and the difference of the accents still cause the reduction of the accuracy rate of the speech recognition, which affects the actual user experience.
Aiming at the problems, at present, an independent voice recognition system is provided for various languages, but a user needs to actively cooperate with the voice recognition system for selecting the corresponding language, and especially under the condition that the user unconsciously mix with a plurality of languages in the interaction process, the independent voice recognition system for each language cannot accurately recognize the language. Even if the users cooperate with each other, correct texts in corresponding languages can be recognized, people who do not know the languages still cannot directly understand the meanings of the voices and need to be translated through a translation system, and therefore interaction efficiency is greatly reduced.
Disclosure of Invention
The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which are used for solving the problems that voice recognition needs to be carried out after languages are manually selected and the interaction efficiency is influenced because the recognized text needs to be translated.
The invention provides a voice recognition method, which comprises the following steps:
performing language identification on a voice to be identified to obtain language characteristics of the voice to be identified;
and performing voice decoding on the coding features of the voice to be recognized based on the language features to obtain recognition texts of the voice to be recognized respectively in a voice language and a preset language, wherein the voice language is the language indicated by the language features.
According to a speech recognition method provided by the present invention, the performing speech decoding on the coding feature of the speech to be recognized based on the language feature to obtain recognition texts of the speech to be recognized in a speech language and a preset language respectively includes:
based on the language features, performing voice decoding on the coding features in the voice language to obtain decoding features and a recognition text of the voice to be recognized in the voice language;
and based on the decoding characteristics and the language characteristics, or based on the language characteristics, performing voice decoding on the coding characteristics in the preset language to obtain a recognition text of the voice to be recognized in the preset language.
According to a speech recognition method provided by the present invention, the language recognition of the speech to be recognized to obtain the language features of the speech to be recognized includes:
extracting acoustic features of the voice to be recognized to obtain the acoustic features of the voice to be recognized;
based on the acoustic features, performing language recognition on the voice to be recognized to obtain language features of the voice to be recognized;
and carrying out voice recognition coding on the voice to be recognized based on the acoustic features to obtain the coding features of the voice to be recognized.
According to a speech recognition method provided by the present invention, the performing speech recognition coding on the speech to be recognized based on the acoustic feature to obtain the coding feature of the speech to be recognized includes:
and carrying out voice recognition coding on the voice to be recognized based on the acoustic features and the language features to obtain the coding features of the voice to be recognized.
According to a speech recognition method provided by the present invention, the performing language recognition on the speech to be recognized based on the acoustic feature to obtain the language feature of the speech to be recognized includes:
inputting the acoustic features into a language identification model, performing language feature extraction by the language identification model based on the acoustic features, and performing language identification based on the extracted language features to obtain the language of the voice to be identified output by the language identification model;
the language identification model is obtained based on acoustic feature training of sample voice under each sample language.
According to a speech recognition method provided by the present invention, the performing speech decoding on the coding feature of the speech to be recognized based on the language feature to obtain recognition texts of the speech to be recognized in a speech language and a preset language respectively includes:
inputting the acoustic features and language features of the voice to be recognized into a voice recognition model, performing voice recognition coding on the voice to be recognized by the voice recognition model based on the acoustic features and the language features or based on the acoustic features, and performing voice decoding based on the language features and coding features obtained by coding to obtain recognition texts of the voice to be recognized output by the voice recognition model under the voice language and a preset language respectively;
the speech recognition model is obtained by training the difference between a sample text and a predicted text of sample speech in a speech language and a preset language, and the predicted text is determined by the trained speech recognition model based on the acoustic features and the language features of the sample speech.
According to the voice recognition method provided by the invention, the loss function of the voice recognition model is obtained by weighting and summing the voice language loss and the preset language loss;
the language loss of the voice is determined based on the difference between the sample text and the predicted text of the sample voice in the language of the voice, and the preset language loss is determined based on the difference between the sample text and the predicted text of the sample voice in the preset language.
The present invention also provides a speech recognition apparatus comprising:
the language identification unit is used for carrying out language identification on the voice to be identified to obtain the language characteristics of the voice to be identified;
and the voice decoding unit is used for carrying out voice decoding on the coding characteristics of the voice to be recognized based on the language characteristics to obtain recognition texts of the voice to be recognized respectively under the voice language and a preset language, wherein the voice language is the language indicated by the language characteristics.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of any one of the voice recognition methods.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the speech recognition method as described in any of the above.
The speech recognition method, the speech recognition device, the electronic equipment and the storage medium provided by the invention have the advantages that based on the language characteristics, the coding characteristics of the speech to be recognized are subjected to speech decoding under the speech language and the preset language, so that the bilingual recognition output under the speech language and the preset language is obtained, the speech decoding of the speech language and the preset language is parallel, translation is not needed on the basis of recognizing the text by the speech language, the accuracy of recognizing the text by the preset language is effectively improved, and the response time of speech recognition is shortened. The speech language and the speech decoding of the preset speech share the coding characteristics of the speech to be recognized, namely, a unified modeling mode is provided, so that the deployment is more flexible, and the deployment and maintenance cost can be effectively reduced.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a flow chart illustrating a speech recognition method provided by the present invention;
FIG. 2 is a flow chart illustrating step 120 of the speech recognition method provided by the present invention;
FIG. 3 is a flow chart illustrating step 110 of the speech recognition method provided by the present invention;
FIG. 4 is a flow chart of a speech recognition method provided by the present invention;
FIG. 5 is a schematic diagram of a voice recognition apparatus according to the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The specific process for multi-language speech recognition in the related art is to firstly receive speech data, then input the speech data into a speech recognition system corresponding to a language type specified by a user for speech recognition, and if the recognized text does not belong to a language understandable by the user, further translate the recognized text into a language understandable by the user, for example, the speech to be recognized is cantonese, the user has selected the language before inputting the speech as cantonese, so that the text obtained by performing speech recognition is cantonese text, but if the user himself does not recognize cantonese text, the text obtained by speech recognition still cannot be directly read, and the system also translates the cantonese text into mandarin text. The above scheme has the following problems:
1) the cascade connection of the voice recognition and the text translation can cause the further accumulation of the mistakes in the voice recognition in the text translation, thereby influencing the text translation effect. For example, if the user specifies a type of language as cantonese, the speech recognition model for cantonese has 90% of the characters, and the text translation has 90% of the accuracy of translating cantonese into mandarin, then the final cantonese audio recognizes mandarin text as 90% × 90% — 81%.
2) Speech recognition and text translation are cascaded, so that the whole response time is the speech recognition response time plus the translation response time, the waiting time of the user is undoubtedly increased by the combined mode, and the user experience is greatly reduced.
3) The cascade connection of the voice recognition and the text translation occupies more hardware resources, so that the deployment cost and the later maintenance cost are greatly increased.
In view of the above problem, an embodiment of the present invention provides a speech recognition method, and fig. 1 is a schematic flow chart of the speech recognition method provided by the present invention, as shown in fig. 1, the method includes:
and 110, performing language identification on the voice to be identified to obtain language characteristics of the voice to be identified.
Specifically, the speech to be recognized is speech data used for performing speech recognition, the speech to be recognized may be obtained through a sound pickup device, where the sound pickup device may be a smart phone, a tablet computer, or a smart electrical apparatus such as a sound, a television, an air conditioner, and the like, the sound pickup device may amplify and reduce noise of the speech to be recognized after the speech to be recognized is picked up by a microphone array, and in addition, the speech to be recognized may be a speech segment formed after the sound pickup is finished, or a speech stream in a real-time sound pickup process.
In order to avoid the operation that a user specifies the language of the voice to be recognized by himself, and thus avoid the complicated problem of user operation and the messy problem caused by the fact that the user forgets to operate or selects the voice by mistake for subsequent voice recognition, the embodiment of the invention carries out language recognition on the voice to be recognized, and thus obtains the language characteristics capable of reflecting the language information of the voice to be recognized. The language feature may be a coded representation of the language type corresponding to the speech to be recognized, or may be a vector representation reflecting the feature of the speech to be recognized on the representation of the corresponding language type, which is not specifically limited in the embodiment of the present invention.
The language recognition may be performed by using the acoustic feature of the speech to be recognized, for example, determining a language corresponding to the acoustic feature through a pre-trained language recognition model, where the language feature of the speech to be recognized may be an output feature of any layer in the language recognition model in the process of performing the language recognition based on the acoustic feature by the language recognition model.
And 120, performing voice decoding on the coding features of the voice to be recognized based on the language features to obtain recognition texts of the voice to be recognized respectively in the voice language and a preset language, wherein the voice language is the language indicated by the language features.
In view of the fact that the language of the speech to be recognized and the language understandable by the user may not be the same language, if the speech recognition is completed and then the translation is performed, problems can be caused in accuracy, response timeliness and maintenance cost, and for the situation, the embodiment of the invention provides the method for performing parallel speech decoding in the speech language and the preset language by directly applying the coding features of the speech to be recognized, so as to obtain the recognition text of the speech to be recognized in the speech language and the preset language respectively.
The process of speech recognition can be generally divided into an encoding process and a decoding process, wherein the encoding process is used for performing speech encoding on the speech to be recognized so as to obtain the encoding characteristics of the speech to be recognized, and the decoding process is used for decoding the encoding characteristics of the speech to be recognized so as to obtain the recognition text of the speech to be recognized. Here, the encoding process and the decoding process may be implemented by a transform or other type of model in the form of an encoder + decoder. The obtained coding features of the speech to be recognized are used for reflecting the semantics of the speech to be recognized, namely reflecting the speech content of the speech to be recognized.
Specifically, in the process of speech decoding, the coding features obtained by coding can be further divided into two branches for parallel decoding, that is, one branch applies the coding features and performs speech decoding in the speech language, and the other branch applies the coding features and performs speech decoding in the preset language. Here, the language indicated by the language feature of the speech, i.e., the language used by the speech to be recognized itself, is a preset language, i.e., a preset language, which is generally a language that can be understood or needs to be acquired by the user. The two branches share the coding characteristics obtained in the speech recognition coding stage, thereby reducing the consumption of computing resources for bilingual recognition output.
In the decoding process of the two branches, the language characteristics and the coding characteristics can be combined to perform voice decoding. The language feature applied to the voice decoding process under the voice language can guide the voice decoding to be executed towards the direction of the expression form of the more attached voice language, and the language feature applied to the voice decoding process under the preset language can assist the voice decoding to migrate from the expression mode of the voice language to the expression mode of the preset language, so that the more reliable and accurate voice language and the preset language bilingual recognition output are realized.
It should be noted that the voice language and the predetermined language may be languages in the same country or region, for example, the voice recognition may support various dialects, such as the chinese language, the cantonese language, the southern Fujian language, the Wei language, the Shanghai language, and the like, and at this time, the chinese language, the Guangdong language, the southern Fujian language, the Wei language, the Shanghai language, and the like may be considered as possible voice languages, and the mandarin language may be used as the predetermined language. The speech language and the predetermined language may also cover the same country and region and languages in different countries and regions, for example, the speech recognition may support multiple languages, such as mandarin, japanese, russian, english, etc., all of which are considered as possible speech languages, and mandarin or english or one language designated by other users is taken as the predetermined language, thereby ensuring flexibility of the speech recognition.
The method provided by the embodiment of the invention is based on the language features, the coding features of the voice to be recognized are subjected to voice decoding under the voice language and the preset language, so that the dual-language recognition output under the voice language and the preset language is obtained, the voice decoding of the voice language and the preset language is parallel, translation is not needed on the basis of recognizing the text by the voice language, the accuracy of recognizing the text by the preset language is effectively improved, and the response time of voice recognition is shortened. The speech language and the speech decoding of the preset speech share the coding characteristics of the speech to be recognized, namely, a unified modeling mode is provided, so that the deployment is more flexible, and the deployment and maintenance cost can be effectively reduced.
Based on the above embodiment, fig. 2 is a schematic flowchart of step 120 in the speech recognition method provided by the present invention, and as shown in fig. 2, step 120 includes:
step 121, based on the language features, performing speech decoding on the coding features in the speech language to obtain decoding features and a recognition text of the speech to be recognized in the speech language;
and step 122, based on the decoding characteristics and the language characteristics, or based on the language characteristics, performing speech decoding in the preset language on the coding characteristics to obtain a recognition text of the speech to be recognized in the preset language.
Specifically, in the bilingual decoding process that shares the coding features of the speech to be recognized, the speech decoding for the speech language and the speech decoding for the preset language may be divided to be described:
the speech decoding aiming at the speech language can be realized based on language characteristics and coding characteristics, wherein the language characteristics can guide the speech decoding aiming at the coding characteristics to be executed towards a direction more fitting the expression form of the speech language. Specifically, during decoding, language features and coding features may be fused first, and then the features obtained by fusion are applied to a speech decoding process of a speech language, in which decoding features are generated, and a final decoded text, that is, an identification text in the speech language, is output based on the decoding features. The decoding feature here may comprise an encoded representation of the output character at the respective decoding instant or a semantic representation of the output character at the respective decoding instant, i.e. the decoding feature may reflect the decoding result of the speech decoding in the speech language.
The speech decoding aiming at the preset language can be realized based on language features and coding features, wherein the language features can assist the speech decoding to migrate from the expression mode of the speech language to the expression mode of the preset language, specifically, during the decoding, the language features and the coding features can be firstly fused, then the features obtained through the fusion are applied to the speech decoding process of the preset language, and the recognition text under the preset language is obtained through the speech decoding. At this time, the speech decoding of the speech language and the preset language are independent and independent from each other.
In addition, for the decoding of the speech of the preset language, the decoding characteristics obtained by decoding the speech of the preset language can be further combined on the basis of the language characteristics and the coding characteristics, wherein the decoding characteristics for the speech language can provide reference for the decoding of the speech of the preset language, so that the decoding of the speech of the preset language can learn the decoding result of the speech language, and can better migrate from the expression mode of the speech language to the expression mode of the preset language with the aid of the language characteristics. Specifically, during decoding, the decoding feature, the language feature and the encoding feature may be fused, and then the feature obtained by fusion is applied to a speech decoding process of a preset language, so as to obtain an identification text in the preset language through speech decoding. At this time, the speech decoding of the preset language refers to the result of the speech decoding of the speech language, and compared with a completely independent decoding mode, the reliability of the text to be recognized obtained by decoding is further improved.
It should be noted that, the above-mentioned fusion manner of the language feature and the coding feature, or the fusion manner of the decoding feature, the language feature and the coding feature may be to directly splice the features, or to add or sum the features, and the like, and this is not limited in this embodiment of the present invention.
Based on the above embodiment, fig. 3 is a schematic flowchart of step 110 in the speech recognition method provided by the present invention, and as shown in fig. 3, step 110 includes:
and step 111, extracting acoustic features of the voice to be recognized to obtain the acoustic features of the voice to be recognized.
Here, the extraction of the acoustic features may be performed by performing frame windowing on the speech to be recognized, and then performing pre-emphasis on each frame, and then extracting the acoustic features of each frame or each combination of multiple frames through fast fourier transform FFT, where the acoustic features may be Mel Frequency Cepstral Coefficient (MFCC) features or Perceptual Linear Prediction (PLP) features, and the like.
And 112, performing language identification on the voice to be identified based on the acoustic features to obtain language features of the voice to be identified.
And 113, performing voice recognition coding on the voice to be recognized based on the acoustic features to obtain coding features of the voice to be recognized.
Specifically, the acoustic feature of the speech to be recognized obtained in step 111 is applied to, on one hand, language recognition of the speech to be recognized, and on the other hand, speech recognition coding of the speech to be recognized. The language identification may be performed before the speech recognition coding, or may be performed in parallel with the speech recognition coding, that is, step 112 and step 113 may be performed synchronously, or may be performed sequentially.
The language recognition based on the acoustic features may specifically be performed by inputting the acoustic features of the speech to be recognized into a pre-trained language recognition model, and using intermediate features generated in the process of mapping the acoustic features to languages of the language recognition model or features representing a final result as the language features. The language identification model may be obtained by training based on acoustic features of sample speech in various languages.
The speech recognition coding is performed based on the acoustic features, specifically, the acoustic features of the speech to be recognized are input into a pre-trained coder of a speech recognition model, and the speech recognition coding is performed on the acoustic features by the coder, so that the coding features capable of reflecting the speech content of the speech to be recognized are obtained. The speech recognition model may be trained based on sample speech in various languages and corresponding sample recognition texts thereof.
According to the method provided by the embodiment of the invention, the language features and the coding features of the voice to be recognized are obtained, the acoustic features of the voice to be recognized are shared, namely, only one model for extracting the acoustic features is deployed, the features needing to be input can be provided for the two tasks of language recognition and voice recognition, the deployment and maintenance cost of the voice recognition is reduced, and the response efficiency of the voice recognition is improved.
Based on any of the above embodiments, step 113 includes:
and carrying out voice recognition coding on the voice to be recognized based on the acoustic features and the language features to obtain the coding features of the voice to be recognized.
Specifically, when speech recognition coding is performed, language features can be further combined on the basis of the acoustic features, wherein the language features can provide reference for the speech recognition coding, so that the characteristics of the acoustic features in the speech language when speech content is expressed can be more considered in the process of the speech recognition coding, and thus, the reliability and the accuracy of the coding features on the representation of the speech content are improved. Specifically, during encoding, the acoustic features and the language features may be fused, and the features obtained by the fusion are applied to the encoding process of speech recognition, so as to obtain the encoding features of the speech to be recognized.
It should be noted that, the above-mentioned manner of fusing the language features and the acoustic features may be to directly splice the features, or to add or sum the features in a weighted manner, which is not specifically limited in this embodiment of the present invention.
The method provided by the embodiment of the invention refers to the language characteristics when the voice recognition coding is carried out, thereby improving the reliability and the accuracy of the coding characteristics on the representation of the voice content, and further improving the reliability and the accuracy of the voice recognition.
Based on any of the above embodiments, step 112 includes:
inputting the acoustic features into a language identification model, performing language feature extraction by the language identification model based on the acoustic features, and performing language identification based on the extracted language features to obtain the language of the voice to be identified output by the language identification model;
the language identification model is obtained based on acoustic feature training of sample voice under each sample language.
Specifically, the language features are obtained through a language identification model, and the trained language identification model can distinguish the language of the input acoustic features and output the language category to which the speech to be identified belongs. In the process, the language recognition model needs to further abstract the language features capable of reflecting the speech to be recognized from the input acoustic features, so that the language of the speech to be recognized is determined based on the language features. Here, the language feature of the speech to be recognized may be an output feature of an arbitrary layer in a language recognition model of a plurality of levels.
Before step 112 is executed, the training of the language identification model needs to be completed in advance, and the training method specifically includes the following steps:
first, sample data is collected, where the sample data needs to include the acoustic features of the sample speech in each sample language, and it can also be understood that the sample data includes the acoustic features of the sample speech and the language type of the sample speech. The sample language, i.e. the language to be supported in the speech recognition, should include sample speech of at least two sample languages in the sample data, and the number and the category of the sample languages specifically included in the sample data may be determined according to the requirement of the speech recognition.
And then, on the basis of the constructed initial model, applying sample data to carry out model training, thereby obtaining a language identification model. The initial model may be in the form of Deep Recurrent Neural Network (DRNN) or Deep Convolutional Neural Network (DCNN), which is not limited in this embodiment of the present invention. In addition, when training the initial model, the training mode may be error Back Propagation (BP).
Based on any of the above embodiments, step 120 includes:
inputting the acoustic features and language features of the voice to be recognized into a voice recognition model, performing voice recognition coding on the voice to be recognized by the voice recognition model based on the acoustic features and the language features or based on the acoustic features, and performing voice decoding based on the language features and coding features obtained by coding to obtain recognition texts of the voice to be recognized output by the voice recognition model under the voice language and a preset language respectively;
the speech recognition model is obtained by training the difference between a sample text and a predicted text of sample speech in a speech language and a preset language, wherein the predicted text, namely the speech recognition model in training is determined based on the acoustic characteristics and the language characteristics of the sample speech.
Specifically, the speech recognition process of bilingual recognition output can be realized through a speech recognition model, where the speech recognition model includes two functions of encoding and decoding, and in the encoding stage, an encoder in the speech recognition model can perform speech recognition encoding on the speech to be recognized based on the input acoustic features and language features, or perform speech recognition encoding on the speech to be recognized by applying only the input acoustic features, thereby obtaining encoding features capable of expressing speech content; in the decoding stage, the decoder in the speech recognition model can perform speech decoding under two branches of the speech language and the preset language respectively based on the input language features and the coding features output by the encoder, so that double output of the recognition text under the speech language and the preset language is realized.
Before step 120 is executed, training of the speech recognition model needs to be completed in advance, and the training method specifically includes the following steps:
firstly, sample data is collected, where the sample data needs to include acoustic features and language features of sample voices of each sample language, and sample texts of the sample voices in the sample language and a preset language respectively. The sample language, i.e. the language to be supported in the speech recognition, should include sample speech of at least two sample languages in the sample data, and the number and the category of the sample languages specifically included in the sample data may be determined according to the requirement of the speech recognition.
And then, on the basis of the constructed initial model, applying sample data to carry out model training, thereby obtaining a voice recognition model. The initial model can be improved on the basis of a general encoder-decoder, such as a transform, for example, two decoders can be accessed after the encoder as two decoding branches for decoding the speech language and the predetermined language, respectively. Specifically, in the model training process, a speech recognition model in training, namely an initial model, is decoded respectively in a speech language and a preset language based on acoustic features and language features of input sample speech, so that predicted texts in the speech language and the preset language are output, then, a loss value of the current initial model is determined by comparing differences between the sample speech in the speech language and the sample speech in the preset language and the predicted texts, and parameters of the initial model are updated and iterated until the speech recognition model is obtained after training. Here, in both decoding layers of the initial model, softmax can be applied to optimize the output results. When the initial model is trained, the training mode can be a Stochastic steepest Descent (SGD) method, which not only accelerates the convergence rate, but also prevents the model from falling into local optimum.
Further, in the training process of the initial model, the acoustic features and the language features of the sample speech may be subjected to feature processing and fusion to obtain fusion features of the sample speech, where the fusion features have stronger distinctiveness than the acoustic features. On the basis, the fusion features can be further processed and then input into the initial model to obtain the predicted text in the speech language and the preset language.
Based on any of the above embodiments, the loss function of the speech recognition model is obtained by weighted summation of speech language loss and preset language loss;
the language loss of the voice is determined based on the difference between the sample text and the predicted text of the sample voice in the language of the voice, and the preset language loss is determined based on the difference between the sample text and the predicted text of the sample voice in the preset language.
Specifically, considering that the speech recognition model is a dual-output model, the performance of the speech recognition model is embodied no matter the speech recognition is performed in the speech language or the speech recognition is performed in the preset language. Therefore, when the speech recognition model is trained, the loss function can be divided into two parts of speech language loss and preset language loss for measurement, and updating iteration of model parameters can be carried out by weighting and summing the speech language loss and the preset language loss as final loss values, so that multi-target combined training is realized.
The speech language loss is used for representing the loss of speech language recognition, and may be specifically determined based on a difference between a sample text and a predicted text of a sample speech in a speech language, where the greater the difference between the sample text and the predicted text in the speech language, the greater the speech language loss, and the smaller the difference between the sample text and the predicted text in the speech language, the smaller the speech language loss.
The preset language loss is used for representing the loss of the preset language identification, and may be specifically determined based on the difference between the sample text and the predicted text of the sample speech in the preset language, where the larger the difference between the sample text and the predicted text in the preset language is, the larger the preset language loss is, and the smaller the difference between the sample text and the predicted text in the preset language is, the smaller the preset language loss is.
The loss function L of the speech recognition model can thus be expressed as the following equation:
L=aL1+(1-a)L2
in the formula, L1For speech language loss, L2For the default language loss, a is weight, and a is greater than or equal to 0 and less than or equal to 1.
Based on any of the above embodiments, fig. 4 is a schematic flow chart of the speech recognition method provided by the present invention, and as shown in fig. 4, the speech recognition may be implemented by the following steps:
firstly, determining a voice to be recognized, taking the voice to be recognized as an input, and performing acoustic feature extraction, thereby obtaining the acoustic feature of the voice to be recognized. The acoustic feature extraction here can be realized by a feature extraction module, and specifically, when the acoustic feature extraction is performed, in order to improve the distinctiveness of the acoustic feature, the extracted spectral feature may be transformed, for example, each frame of voice data and a plurality of frames of voice data before and after each frame of voice data are used as the input of a feature extraction model, and the output of the feature extraction model is used as the transformed acoustic feature.
After obtaining the acoustic features of the speech to be recognized, inputting the acoustic features into a language recognition model trained in advance, thereby obtaining the language features obtained by performing feature extraction on the acoustic features in the language recognition process.
Subsequently, the acoustic features and the language features of the speech to be recognized are applied to speech recognition. The dashed box in fig. 4 is a speech recognition model, which includes a coding layer and two parallel decoding layers, i.e., a speech language decoding layer and a preset language decoding layer. The voice language decoding layer is used for carrying out voice decoding under the voice language by using the input acoustic features and the language features obtained by coding to obtain coding features, the voice language decoding layer is used for carrying out voice decoding under the voice language by using the input language features and the coding features obtained by coding to obtain a recognition text under the voice language, and can reflect the decoding features of the decoding result of the voice decoding under the voice language, the preset language decoding layer is used for carrying out voice decoding under the preset language by using the input language features and the coding features obtained by coding, and the decoding features obtained by decoding the voice language to obtain the recognition text under the preset language.
For example, the speech to be recognized is cantonese speech, language features capable of representing cantonese speech are obtained through language recognition, and the language features and the acoustic features of the speech to be recognized are input into a speech recognition model, so that cantonese orthographic text and mandarin text are obtained.
Based on any of the above embodiments, fig. 5 is a schematic structural diagram of a speech recognition apparatus provided by the present invention, as shown in fig. 5, the apparatus includes:
a language identification unit 510, configured to perform language identification on a speech to be identified, so as to obtain a language feature of the speech to be identified;
a speech decoding unit 520, configured to perform speech decoding on the coding feature of the speech to be recognized based on the language feature, so as to obtain recognition texts of the speech to be recognized in the speech language and the preset language respectively, where the speech language is the language indicated by the language feature.
The device provided by the embodiment of the invention performs the speech decoding under the speech language and the preset language based on the language characteristics and the coding characteristics of the speech to be recognized, thereby obtaining the dual-language recognition output under the speech language and the preset language, the speech language and the speech decoding of the preset language are parallel, the translation is not needed on the basis of recognizing the text by the speech language, the accuracy of recognizing the text by the preset language is effectively improved, and the response time of the speech recognition is shortened. The speech language and the speech decoding of the preset speech share the coding characteristics of the speech to be recognized, namely, a unified modeling mode is provided, so that the deployment is more flexible, and the deployment and maintenance cost can be effectively reduced.
Based on any of the above embodiments, the speech decoding unit is configured to:
based on the language features, performing voice decoding on the coding features in the voice language to obtain decoding features and a recognition text of the voice to be recognized in the voice language;
and based on the decoding characteristics and the language characteristics, or based on the language characteristics, performing voice decoding on the coding characteristics in the preset language to obtain a recognition text of the voice to be recognized in the preset language.
Based on any of the embodiments above, the language identification unit is configured to:
extracting acoustic features of the voice to be recognized to obtain the acoustic features of the voice to be recognized;
based on the acoustic features, performing language recognition on the voice to be recognized to obtain language features of the voice to be recognized;
and carrying out voice recognition coding on the voice to be recognized based on the acoustic features to obtain the coding features of the voice to be recognized.
Based on any of the embodiments above, the language identification unit is configured to:
and carrying out voice recognition coding on the voice to be recognized based on the acoustic features and the language features to obtain the coding features of the voice to be recognized.
Based on any of the embodiments above, the language identification unit is configured to:
inputting the acoustic features into a language identification model, performing language feature extraction by the language identification model based on the acoustic features, and performing language identification based on the extracted language features to obtain the language of the voice to be identified output by the language identification model;
the language identification model is obtained based on acoustic feature training of sample voice under each sample language.
Based on any of the above embodiments, the speech decoding unit is configured to:
inputting the acoustic features and language features of the voice to be recognized into a voice recognition model, performing voice recognition coding on the voice to be recognized by the voice recognition model based on the acoustic features and the language features or based on the acoustic features, and performing voice decoding based on the language features and coding features obtained by coding to obtain recognition texts of the voice to be recognized output by the voice recognition model under the voice language and a preset language respectively;
the speech recognition model is obtained by training the difference between a sample text and a predicted text of sample speech in a speech language and a preset language, and the predicted text is determined by the trained speech recognition model based on the acoustic features and the language features of the sample speech.
Based on any of the above embodiments, the loss function of the speech recognition model is obtained by weighted summation of speech language loss and preset language loss;
the language loss of the voice is determined based on the difference between the sample text and the predicted text of the sample voice in the language of the voice, and the preset language loss is determined based on the difference between the sample text and the predicted text of the sample voice in the preset language.
Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a speech recognition method comprising:
performing language identification on a voice to be identified to obtain language characteristics of the voice to be identified;
and performing voice decoding on the coding features of the voice to be recognized based on the language features to obtain recognition texts of the voice to be recognized respectively in a voice language and a preset language, wherein the voice language is the language indicated by the language features.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a speech recognition method provided by the above methods, the method comprising:
performing language identification on a voice to be identified to obtain language characteristics of the voice to be identified;
and performing voice decoding on the coding features of the voice to be recognized based on the language features to obtain recognition texts of the voice to be recognized respectively in a voice language and a preset language, wherein the voice language is the language indicated by the language features.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the speech recognition methods provided above, the method comprising:
performing language identification on a voice to be identified to obtain language characteristics of the voice to be identified;
and performing voice decoding on the coding features of the voice to be recognized based on the language features to obtain recognition texts of the voice to be recognized respectively in a voice language and a preset language, wherein the voice language is the language indicated by the language features.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A speech recognition method, comprising:
performing language identification on a voice to be identified to obtain language characteristics of the voice to be identified;
and performing voice decoding on the coding features of the voice to be recognized based on the language features to obtain recognition texts of the voice to be recognized respectively in a voice language and a preset language, wherein the voice language is the language indicated by the language features.
2. The speech recognition method according to claim 1, wherein the speech decoding the coding feature of the speech to be recognized based on the language feature to obtain recognition texts of the speech to be recognized in a speech language and a preset language respectively comprises:
based on the language features, performing voice decoding on the coding features in the voice language to obtain decoding features and a recognition text of the voice to be recognized in the voice language;
and based on the decoding characteristics and the language characteristics, or based on the language characteristics, performing voice decoding on the coding characteristics in the preset language to obtain a recognition text of the voice to be recognized in the preset language.
3. The speech recognition method according to claim 1, wherein the performing language recognition on the speech to be recognized to obtain the language features of the speech to be recognized comprises:
extracting acoustic features of the voice to be recognized to obtain the acoustic features of the voice to be recognized;
based on the acoustic features, performing language recognition on the voice to be recognized to obtain language features of the voice to be recognized;
and carrying out voice recognition coding on the voice to be recognized based on the acoustic features to obtain the coding features of the voice to be recognized.
4. The speech recognition method according to claim 3, wherein the performing speech recognition coding on the speech to be recognized based on the acoustic feature to obtain the coding feature of the speech to be recognized comprises:
and carrying out voice recognition coding on the voice to be recognized based on the acoustic features and the language features to obtain the coding features of the voice to be recognized.
5. The speech recognition method according to claim 3, wherein the performing language recognition on the speech to be recognized based on the acoustic feature to obtain a language feature of the speech to be recognized comprises:
inputting the acoustic features into a language identification model, performing language feature extraction by the language identification model based on the acoustic features, and performing language identification based on the extracted language features to obtain the language of the voice to be identified output by the language identification model;
the language identification model is obtained based on acoustic feature training of sample voice under each sample language.
6. The speech recognition method according to any one of claims 1 to 5, wherein the performing speech decoding on the coding feature of the speech to be recognized based on the language feature to obtain recognition texts of the speech to be recognized in a speech language and a preset language respectively comprises:
inputting the acoustic features and language features of the voice to be recognized into a voice recognition model, performing voice recognition coding on the voice to be recognized by the voice recognition model based on the acoustic features and the language features or based on the acoustic features, and performing voice decoding based on the language features and coding features obtained by coding to obtain recognition texts of the voice to be recognized output by the voice recognition model under the voice language and a preset language respectively;
the speech recognition model is obtained by training the difference between a sample text and a predicted text of sample speech in a speech language and a preset language, and the predicted text is determined by the trained speech recognition model based on the acoustic features and the language features of the sample speech.
7. The speech recognition method according to claim 6, wherein the loss function of the speech recognition model is obtained by weighted summation of the speech language loss and the predetermined language loss;
the language loss of the voice is determined based on the difference between the sample text and the predicted text of the sample voice in the language of the voice, and the preset language loss is determined based on the difference between the sample text and the predicted text of the sample voice in the preset language.
8. A speech recognition apparatus, comprising:
the language identification unit is used for carrying out language identification on the voice to be identified to obtain the language characteristics of the voice to be identified;
and the voice decoding unit is used for carrying out voice decoding on the coding characteristics of the voice to be recognized based on the language characteristics to obtain recognition texts of the voice to be recognized respectively under the voice language and a preset language, wherein the voice language is the language indicated by the language characteristics.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the speech recognition method according to any of claims 1 to 7 are implemented when the processor executes the program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 7.
CN202111550980.9A 2021-12-17 2021-12-17 Voice recognition method and device, electronic equipment and storage medium Pending CN114171002A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111550980.9A CN114171002A (en) 2021-12-17 2021-12-17 Voice recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111550980.9A CN114171002A (en) 2021-12-17 2021-12-17 Voice recognition method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114171002A true CN114171002A (en) 2022-03-11

Family

ID=80487202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111550980.9A Pending CN114171002A (en) 2021-12-17 2021-12-17 Voice recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114171002A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device
CN115831094A (en) * 2022-11-08 2023-03-21 北京数美时代科技有限公司 Multilingual voice recognition method, system, storage medium and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device
CN115831094A (en) * 2022-11-08 2023-03-21 北京数美时代科技有限公司 Multilingual voice recognition method, system, storage medium and electronic equipment
CN115831094B (en) * 2022-11-08 2023-08-15 北京数美时代科技有限公司 Multilingual voice recognition method, multilingual voice recognition system, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
US11875775B2 (en) Voice conversion system and training method therefor
CN110827801B (en) Automatic voice recognition method and system based on artificial intelligence
CN111968679B (en) Emotion recognition method and device, electronic equipment and storage medium
CN112750419B (en) Speech synthesis method, device, electronic equipment and storage medium
CN111339278B (en) Method and device for generating training speech generating model and method and device for generating answer speech
CN112634856A (en) Speech synthesis model training method and speech synthesis method
CN110797016A (en) Voice recognition method and device, electronic equipment and storage medium
CN113539242A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN113707125B (en) Training method and device for multi-language speech synthesis model
CN112771607A (en) Electronic device and control method thereof
CN114171002A (en) Voice recognition method and device, electronic equipment and storage medium
CN110853621B (en) Voice smoothing method and device, electronic equipment and computer storage medium
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
CN111710337A (en) Voice data processing method and device, computer readable medium and electronic equipment
CN112397056A (en) Voice evaluation method and computer storage medium
CN112837669A (en) Voice synthesis method and device and server
CN113658577A (en) Speech synthesis model training method, audio generation method, device and medium
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN112201275A (en) Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium
CN113793599A (en) Training method of voice recognition model and voice recognition method and device
KR20210123545A (en) Method and apparatus for conversation service based on user feedback
CN115985320A (en) Intelligent device control method and device, electronic device and storage medium
CN115861670A (en) Training method of feature extraction model and data processing method and device
CN114519358A (en) Translation quality evaluation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination