CN113077781A

CN113077781A - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN113077781A
Application number: CN202110621665.4A
Authority: CN
Inventors: 李成飞; 林连志; 杨嵩
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2021-07-06
Anticipated expiration: 2041-06-04
Also published as: CN113077781B

Abstract

The application discloses a voice recognition method, a voice recognition device, electronic equipment and a storage medium, and the specific implementation scheme is as follows: carrying out language classification processing on the voice information to obtain language information; analyzing and processing the statement relationship of the voice information to obtain language information for describing the statement relationship in the voice information; in the process of extracting the voice features from the language information to obtain the voice features, inputting the voice information into a voice coding model, extracting the acoustic features from the voice coding model, and taking the obtained acoustic features as the voice features; and performing voice recognition processing according to the language information, the language information and the voice characteristics to obtain a voice recognition result. By the method and the device, the accuracy of voice recognition can be improved.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.

Background

With the adoption of the electronic equipment such as the portable equipment and the mobile phone terminal, the intelligent analysis is more intelligent than the prior art, the analysis capability of the chip is stronger, and the voice information, the video information containing voice and the like can be efficiently analyzed through an artificial intelligence technology.

Taking voice information as an example, users from different regions at home and abroad and different regions at home have different sounding habits, and even users in the same region may have different sounding habits. With the development of globalization, mixed input of various voices (such as chinese and english) becomes a communication normality in the life of a user, so that an accurate voice recognition scheme is urgently needed.

Disclosure of Invention

The application provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium.

According to an aspect of the present application, there is provided a speech recognition method including:

carrying out language classification processing on the voice information to obtain language information;

analyzing and processing the statement relationship of the voice information to obtain language information for describing the statement relationship in the voice information;

in the process of extracting the voice features from the language information to obtain the voice features, inputting the voice information into a voice coding model, extracting the acoustic features from the voice coding model, and taking the obtained acoustic features as the voice features;

and performing voice recognition processing according to the language information, the language information and the voice characteristics to obtain a voice recognition result.

According to another aspect of the present application, there is provided a voice recognition apparatus including:

the classification module is used for carrying out language classification processing on the voice information to obtain language information;

the analysis module is used for analyzing and processing the statement relationship of the voice information to obtain language information for describing the statement relationship in the voice information;

the extraction module is used for extracting the voice characteristics of the language information, inputting the voice information into a voice coding model in the process of obtaining the voice characteristics, extracting the acoustic characteristics of the voice information in the voice coding model, and taking the obtained acoustic characteristics as the voice characteristics;

and the voice recognition module is used for carrying out voice recognition processing according to the language information, the language information and the voice characteristics to obtain a voice recognition result.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as provided by any one of the embodiments of the present application.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.

By adopting the method and the device, the language classification processing can be carried out on the voice information to obtain the language information. And analyzing the sentence relation of the voice information to obtain language information for describing the sentence relation in the voice information. The speech feature may be obtained by performing speech feature extraction processing on the language information, and specifically, the speech feature may be obtained by inputting the speech information into a speech coding model, performing extraction processing on an acoustic feature on the speech information in the speech coding model, and using the obtained acoustic feature as the speech feature, so that comprehensive operation may be performed according to the language information, and the speech feature, and speech recognition processing may be performed through the comprehensive operation, so that a more accurate speech recognition result may be obtained. In other words, the accuracy of speech recognition is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a diagram illustrating an end-to-end Chinese-English hybrid speech recognition framework for federated language information according to an exemplary embodiment of the present application;

FIG. 3 is a flowchart illustrating a process of a language tag stream according to an exemplary application of the present application;

FIG. 4 is a diagram illustrating a conversion of an exemplary Chinese and English text modeling unit according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a federated network in accordance with an application example of an embodiment of the present application;

FIG. 6 is a schematic diagram of a component structure of a speech recognition device according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device for implementing a speech recognition method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The term "at least one" herein means any combination of at least two of any one or more of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C. The terms "first" and "second" used herein refer to and distinguish one from another in the similar art, without necessarily implying a sequence or order, or implying only two, such as first and second, to indicate that there are two types/two, first and second, and first and second may also be one or more.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.

For a speech recognition scheme, taking speech information (such as mixed input chinese and english speech information) as an example, a chinese and english hybrid speech recognition task may be established, and for the chinese and english hybrid speech recognition task, an end-to-end neural network model or a traditional speech recognition model may be adopted. The end-to-end neural network model integrates acoustics, pronunciation dictionaries and language models into a whole and performs combined optimization, and is superior to the traditional speech recognition model in speech recognition. However, although the end-to-end neural network model has been greatly developed in recognition of each language (for example, each single language such as japanese, english, korean, etc.) in current practical applications, the input of multiple languages, i.e., the above-mentioned case of the chinese-english hybrid input, is not satisfactory even if speech recognition is performed using the end-to-end neural network model because the speaker switches the languages at will.

Aiming at the problem of low speech recognition accuracy rate in a Chinese-English mixed input scene, the speech recognition accuracy rate is improved by an end-to-end Chinese-English mixed speech recognition method of combined language information, specifically, a language classification model can be trained, the language information is obtained through the language classification model, and for example, intermediate features (such as bottleneck features) capable of representing the language information are input into the combined model for final decoding operation. Besides the language information, the language information and the voice characteristics are comprehensively considered, and the final recognition result is output through comprehensive operation of the three types of information, namely the language information, the language information and the voice characteristics.

According to an embodiment of the present application, a speech recognition method is provided, and fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present application, which may be applied to a speech recognition apparatus, for example, in a case where the apparatus may be deployed in a terminal or a server or other processing device for execution, feature extraction, speech recognition, and the like may be performed. Among them, the terminal may be a User Equipment (UE), a mobile device, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, and so on. In some possible implementations, the method may also be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 1, includes:

s101, language classification processing is carried out on the voice information to obtain language information.

In one example, the speech information may be input into a trained language classification model, and speech frame level language classification processing may be performed in the language classification model to obtain the language information.

S102, analyzing and processing the statement relationship of the voice information to obtain language information for describing the statement relationship in the voice information.

In one example, the speech information may be input into a language model, and the language model performs analysis processing on sentence relations of text information corresponding to the speech information to obtain the language information.

S103, in the process of extracting the voice characteristics of the language information to obtain the voice characteristics, inputting the voice information into a voice coding model, extracting the acoustic characteristics of the voice information in the voice coding model, and taking the obtained acoustic characteristics as the voice characteristics.

And S104, performing voice recognition processing according to the language information, the language information and the voice characteristics to obtain a voice recognition result.

In one example, the language information, and the speech feature may be input into a joint model, and the language information, and the speech feature may be combined together in the joint model to perform a comprehensive operation to obtain the recognition result.

By adopting the method and the device, the language classification processing can be carried out on the voice information to obtain the language information. And analyzing the sentence relation of the voice information to obtain language information for describing the sentence relation in the voice information. The speech feature can be obtained by extracting the speech feature from the language information, so that comprehensive operation can be performed according to the language information, the language information and the speech feature, speech recognition processing can be performed through the comprehensive operation, and a more accurate speech recognition result can be obtained. In other words, the accuracy of speech recognition is improved.

It should be noted that the above steps S101-S104 can be implemented by using a speech recognition model for chinese-english hybrid input, the speech recognition model can be composed of the above language classification model, the above speech coding model and the above combination model, and the speech recognition model of the present application can be implemented in a situation of not limited to the chinese-english hybrid input, as long as the situation of multi-language hybrid input is met, and the speech recognition processing is performed by comprehensive operation, so that interferences of multi-language hybrid input, pronunciation expression habits of different users, and the like can be eliminated, and the accuracy of speech recognition is greatly improved.

In an embodiment, the performing speech classification processing at a speech frame level in the speech classification model to obtain the language information includes: extracting text information corresponding to the voice information from the language classification model; and obtaining the language information in the language classification model according to the language classification mapping relation between the voice information and each character in the text information. By adopting the embodiment, the voice information can be corresponding to the language classification of the corresponding text information through the established mapping relation, so that the language information to which the voice information belongs can be obtained, and the accuracy of subsequent voice recognition can be improved.

In one embodiment, the analyzing, in the language model, a sentence relationship of text information corresponding to the speech information to obtain the language information includes: and analyzing and processing the statement relation in the language model according to the internal rules of the language to obtain the language information. By adopting the embodiment, the statement relationship can be analyzed and processed according to the internal rules of the language, for example, whether the statement relationship is reasonable or not can be evaluated according to the internal rules of the language, so that more appropriate language information can be known, and the accuracy of subsequent speech recognition can be improved.

In one embodiment, the performing speech recognition processing according to the language information, and the speech feature to obtain a speech recognition result includes: inputting the language information, the language information and the voice characteristics into a joint model; in the combined model, the language information is kept consistent with the voice feature in a vector dimension, and the obtained language information vector and the voice feature vector are spliced to obtain a vector to be processed; in the combined model, the vector to be processed is coded and decoded based on a Recurrent Neural Network (RNN) and the language information, so as to obtain the voice recognition result. By adopting the embodiment, the integrated operation is carried out on the combined model, for better operation effect, the processing of keeping alignment of data can be executed to ensure that the vector dimensions are consistent, then vector splicing and the like are carried out, and finally, the voice recognition processing is carried out according to the integrated operation of the three types of information, namely the language information, the voice characteristics, and the accuracy of the voice recognition is improved.

In one embodiment, the method further comprises: acquiring first voice information in a corpus; performing text labeling processing on first text information corresponding to the first voice information to obtain first text labeling data; taking a data pair constructed by the first voice information and the first text labeling data as voice sample training data; and training the language classification model according to the speech sample training data to obtain a trained language classification model. By adopting the embodiment, the first voice information can be extracted from the corpus in advance, so that the data pair constructed by the first voice information and the first text label data is used for training the language classification model to obtain the trained language classification model, and after the voice information to be processed is received, the language classification can be directly carried out according to the trained language classification model, so that the operation efficiency is higher and the operation precision is higher.

In one embodiment, the method further comprises: processing the Chinese text in the first text information into a single Chinese character; processing English texts in the first text information into subwords; and performing regular processing on the character sequence to be processed, which is formed by the single Chinese character and the sub-words, to obtain language label classification corresponding to each character in the first text information. By adopting the embodiment, the classification processing of the language labels can be carried out aiming at the first text labeling data so as to obtain the language label classification, and the language classification model can be trained better according to the language label classification.

In one embodiment, the using the data pair constructed by the first speech information and the first text annotation data as speech sample training data includes: classifying the language labels obtained by classifying the first text information, and adding the language labels into the first text label data to obtain second text label data; and taking the data pair constructed by the first voice information and the second text labeling data as voice sample training data. By adopting the embodiment, the language label classification can also be used as a part of the training data and added into the first text labeling data to obtain the second text labeling data, and the data pair constructed by the first voice information and the second text labeling data is used as the voice sample training data to better train the language classification model.

In one embodiment, the method further comprises: extracting voice features of the first voice information to obtain first voice features; a connect _ nick _ Temporal _ Classification (CTC) module to classify the first speech feature input sequence; and mapping the first voice feature and the corresponding language tag classification in the CTC module, and then performing length alignment processing. By adopting the embodiment, through the alignment processing after the mapping of the CTC module, the language classification model can be trained based on the first voice feature and the corresponding language label classification, and finally the trained language classification model can be classified by corresponding to a label of one language aiming at each input voice frame, so that more accurate language information can be identified.

Application example:

in the aspect of Chinese-English hybrid speech recognition, in the case of a traditional speech recognition algorithm, a pronunciation dictionary is constructed, for example, the pronunciation dictionary is constructed by mapping English words to Chinese pronunciations, and a pronunciation word list is constructed according to Chinese pronunciation phonemes. In an example of mapping an English word to a Chinese pronunciation, python-Paisen, and then performing phonetic notation on the Paisen according to a Chinese pronunciation dictionary, the situation of partial Chinese-English mixed recognition can be solved by adopting a traditional speech recognition algorithm, and the performance mainly depends on the size of a word list mapping the English word to the Chinese pronunciation, but the process not only needs manual annotation, but also has the situation that many English words are similar to Chinese pronunciation and even some English words cannot be mapped to the Chinese pronunciation, so that the traditional speech recognition model trained by adopting the traditional speech recognition algorithm has poor performance generalization.

The goal of a good deep learning model is: the training data is well popularized to any data in the problem field, namely the model is more universal, is not limited to specific data, and can only be applied to a special model with poor generalization in the problem field. With the development of deep learning, the neural network algorithm based on deep learning is applied to speech recognition, thereby subverting the traditional speech recognition method and not comprising the following steps: the acoustic model, the language model, the pronunciation dictionary and other single and traditional speech recognition modules adopt a neural network to complete the functions of the modules, and the neural network algorithms are collectively called end-to-end speech recognition algorithms in speech recognition. The end-to-end speech recognition algorithm utilizes end-to-end speech code conversion, namely: a single Chinese character (such as Chinese 'HELLO', can be taken as the Chinese character 'you') is selected through a Chinese modeling unit, and English adopts sub-words (such as English 'HELLO', can be taken as a part of English, and is called as sub-words, such as the sub-word 'HEL'). Then, the 'label data pair' formed by single Chinese characters and corresponding sub-words is used as training data to train the model, but in the case of Chinese-English mixed input, due to the fact that a speaker randomly switches languages, the mainstream end-to-end speech recognition algorithm cannot learn the random conversion, and therefore the training of the Chinese-English mixed recognition model is not satisfied any more.

The application example is an end-to-end Chinese-English mixed speech recognition method of combined language information, considering that Chinese and English are different languages in a Chinese-English scene, the input speech information can be subjected to frame-level language classification, then the language information obtained through the language classification is input into a combined model, and particularly, intermediate features (bottleneck) capable of representing the language information can be input into the combined model to perform final decoding operation. By adopting the speech recognition method of the application example, the language information and the speech characteristics are comprehensively considered, and the final speech recognition result is obtained by combining the end-to-end Chinese-English hybrid algorithm of the language information, so that the accuracy of the speech recognition result is greatly improved. The bottleeck feature can only utilize the convolution layer part in the network model, the parts above the full connection layer are discarded, and then the obtained output result is operated on the training set and the test set.

Fig. 2 is a schematic diagram of an end-to-end chinese-english hybrid speech recognition framework of joint language information according to an application example of the embodiment of the present application, as shown in fig. 2, including: language classification models, language models, and association models. The language classification model is used for realizing language classification and can be composed of a voice coding model and a CTC (central processing unit) module, the CTC module is mainly used for processing the alignment of input and output labels in a sequence annotation problem, and the language labels output by the CTC module are classified and can be recorded as LID (line identifier); the language model is used for describing inherent laws of the natural language; the combined model is used for performing comprehensive operation on the output of the language model (such as the output language model information), the output of the CTC module (such as the output language information including language label classification) and the output of the voice coding model (such as the output acoustic characteristic information), outputting an operation result obtained by the combined model, and performing normalization processing through a Softmax function to obtain a final language identification result.

The following specifically describes a specific processing procedure involved in implementing the application example, including the following:

classification of languages

The language classification means to judge language information at a speech frame level according to speech features of input speech information (e.g., acoustic features extracted from audio of the speech information), and may be implemented by, but not limited to, the language classification model, for example, the language classification model may preprocess the originally input speech information and text label data obtained by labeling text corresponding to the speech information, and the speech information and the text label data have a corresponding relationship, and may be recorded as "speech information < -text label data" to form a "data pair". The text labeling data may further include a language label classification, and the specific content of the language classification is as follows:

1) for the voice information, considering that the Fbank feature is already close to the response characteristic of human ears as an evaluation index of voice recognition, in practical application, the Filter Bank feature can be directly extracted from the voice information. Considering that the response of human ears to sound frequency spectrum is nonlinear, the performance of speech recognition can be improved by adopting a front-end processing algorithm like Fbank to process audio in a manner similar to human ears.

2) For the above labeled data pair, the text label preprocessing can be performed on the text corresponding to the voice information, and the language label of the required language can be obtained. Fig. 3 is a schematic processing flow diagram of a language tag stream according to an application example of the embodiment of the present application, and as shown in fig. 3, the processing flow includes the following contents:

firstly, acquiring a text corresponding to voice information, and performing text labeling pretreatment on the text through a Byte Pair Encoding (BPE) model.

For example, the text may be: "HELLO we CHECK together a shop", the preprocessing of text annotation is performed by BPE model, specifically: the original chinese text is processed into a single chinese character, for example: we > "I" and "A"; processing the original english text into subwords, for example: HELLO > "HE" and "LLO", so as to obtain the character sequence to be processed, which is composed of single Chinese characters and sub-words, as follows: "HE LLO we together CH ECK a Tu Bao".

Furthermore, the obtained single Chinese character and sub-word can be written into a Chinese and English modeling unit, in order to improve the efficiency and accuracy of speech recognition, more corpora can be accumulated, and the original Chinese text is processed into a single Chinese character, for example: monkey > "Small" and monkey ", etc., as shown in FIG. 4.

And secondly, mapping language tags based on the voice features.

For example, the character sequence to be processed may be subjected to a regularization process (for example, the regularization process may be implemented by a regularization module) to implement a specific conversion of the language label mapping. The Chinese characters are marked as CN, English letters or sub-words are marked as EN, and special symbols for separating the sub-words are inserted among the sub-words. It should be noted that the special symbol is added, mainly considering that the time period between two chinese characters or between sub-words does not represent any language, a special symbol such as a space (blank) may be used, and is not limited to blank, as long as the symbol capable of realizing the separation between the chinese characters or the sub-words is within the protection scope of the present application.

In the language label "BL EN BL CN BL EN BL CN BL" shown in fig. 3, which corresponds to the character sequence "HE LLO together with CH ECK next bar" to be processed, BL represents the position of the blank, EN represents the position of the english sub-word in the character sequence to be processed, and CN represents the position of a single chinese character in the character sequence to be processed.

3) In the language classification model training process, for extracting acoustic features of audio from voice information, extraction processing can be realized based on a Multi-head Attention (Multi-head Attention) encoding method, so that each head can capture information in different aspects, thereby extracting multiple semantemes and enabling voice recognition and classification to be more accurate. The acoustic features can be encoded to obtain high-order features, and the high-order features are input into the CTC module, so that the speech information at the speech frame level can be judged according to the speech features of the input speech information.

The structure of the speech coding model in fig. 2 is described in detail as follows:

the speech coding model may consist of N =6 identical layers (layers), each Layer consisting of two sub-layers, respectively a multi-head self-attention mechanism (multi-head self-attention mechanism) and a fully connected feed-forward network (fully connected feed-forward network). Wherein each sub-layer can add residual connection (residual connection) and normalization (normalization), so that the output of the speech coding model can be expressed as the following formula (1) -formula (4):

（1）

（2）

（3）

（4）

in the above-mentioned formulas (1) to (4),

representing a speech encoding; x represents an input;

representing the normalization of the layers; Multi-Head stands for a Multi-Head mechanism; concat represents the splicing operation; the head represents the calculation mode of the ith head, the Self Attention mechanism is represented by Self Attention, Q, K and V are vectors obtained by multiplying input vectors with different matrixes respectively, and Q/K/V is usually matrix parallel operation for parallel calculation;

is the dimension of vector V;

is the output parameter matrix of the Multihead; w1, w2, w3 are parameter matrices of the vector Q/K/V, respectively.

CTC is an algorithm suitable for use in situations where it is not known whether inputs and outputs are aligned, and may be defined as follows: input device

The corresponding output is

Where X represents the output of the speech coding model and Y represents the corresponding language tag sequence, but the lengths of X and Y are not equal (the number of speech frames is much larger than the number of language tags), an alignment process is required, and in order to train this type of data, a mapping relationship from X to Y is foundThis can be solved by using the above-mentioned CTC. The CTC loss function can be defined, in particular: for a given input X, the training model expects to obtain a posterior probability of maximizing Y

To do so

Is derivable so that the derivation can be performed using a gradient descent algorithm. For a pair of inputs and outputs (X, Y), the objective of CTC is maximized using equation (5), where a refers to some alignment path of X and Y in the CTC, and its length is consistent with the input sequence X;

is the output at each time instant, and is described in detail as shown in equation (5):

（5）

through the training of the model, each input voice frame can correspond to the label classification of one language.

4) In the language label classification pre-training process, a single language may be used, for example: pure Chinese audio- > text data, and pure English audio- > text data perform the model training of the whole process. The language label classification pre-training is carried out to obtain a pre-trained language classification model, then parameters in the pre-trained language classification model are used as initialization parameters, parameter updating iterative training is continuously carried out, and finally the required language classification model is obtained. Compared with the prior art, the method has the advantages that the random initialization processing is carried out in the training process of the existing model, and the parameters in the pre-training model are adopted for initialization processing, so that the recognition and classification accuracy of the language classification model is higher.

Second, language model

The language model is a mathematical model for describing inherent regularity in natural language, and can be applied to various tasks needing probability evaluation on sentence sequences. The prediction of the internal rules of the language can be carried out by adopting a classical RNN language model, specifically, the recognition result at the last moment is input, and the vector calculated by RNN coding is output.

Third, a joint model, fig. 5 is a schematic diagram of a joint model according to an application example of the embodiment of the present application, as shown in fig. 5, including the following contents:

and a joint model for performing a comprehensive operation of the output of the language model, the output of the CTC module, and the output of the speech coding model to output a final result. Specifically, the output of the CTC module passes through a vector dimension adjustment layer (reshape-layer), the vector dimension representing the language information is kept consistent with the output of the speech coding model, the two vectors are spliced in a full connection layer (concatemate layer), for example, the acoustic information at the frame level and the language information at the frame level are spliced, and then the spliced acoustic information and the speech information are input into a language model based on an RNN loop network, the output at the previous moment is coded and input into the RNN loop network for decoding.

It should be noted that o1, o2, and o3 in fig. 5 respectively show the output of the recognition result at each time, because the recognition result is not present for each speech information frame, the calculation of the loss function can be performed in a CTC manner, and thus, the automatic alignment of a plurality of recognition results can be performed.

The present application provides a speech recognition apparatus, fig. 6 is a schematic diagram of a composition structure of the speech recognition apparatus according to an embodiment of the present application, and as shown in fig. 6, the apparatus includes: a classification module 41, configured to perform language classification processing on the voice information to obtain language information; the analysis module 42 is configured to perform analysis processing on the statement relationship of the voice information to obtain language information for describing the statement relationship in the voice information; an extracting module 43, configured to perform extraction processing on the speech features of the language information, input the speech information into a speech coding model in the process of obtaining the speech features, perform extraction processing on the acoustic features of the speech information in the speech coding model, and use the obtained acoustic features as the speech features; and the voice recognition module 44 is configured to perform voice recognition processing according to the language information, and the voice feature to obtain a voice recognition result.

In one embodiment, the classification module is configured to input the speech information into a trained language classification model; and performing language classification processing at a speech frame level in the language classification model to obtain the language information.

In one embodiment, the classification module is configured to extract text information corresponding to the speech information from the language classification model; and obtaining the language information in the language classification model according to the language classification mapping relation between the voice information and each character in the text information.

In one embodiment, the analysis module is configured to input the speech information into a language model; and analyzing and processing sentence relations of the text information corresponding to the voice information in the language model to obtain the language information.

In an embodiment, the analysis module is configured to perform analysis processing on a statement relationship in the language model according to an internal rule of a language, so as to obtain the language information.

In one embodiment, the speech recognition module is configured to input the language information, and the speech feature into a joint model; in the combined model, the language information is kept consistent with the voice feature in a vector dimension, and the obtained language information vector and the voice feature vector are spliced to obtain a vector to be processed; and in the combined model, coding and decoding the vector to be processed based on a recurrent neural network and the language information to obtain the voice recognition result.

In one embodiment, the system further comprises a training module, configured to obtain first voice information in the corpus; performing text labeling processing on first text information corresponding to the first voice information to obtain first text labeling data; taking a data pair constructed by the first voice information and the first text labeling data as voice sample training data; and training the language classification model according to the speech sample training data to obtain a trained language classification model.

In one embodiment, the system further comprises a language tag classification module, configured to process a chinese text in the first text information into a single chinese character; processing English texts in the first text information into subwords; and performing regular processing on the character sequence to be processed, which is formed by the single Chinese character and the sub-words, to obtain language label classification corresponding to each character in the first text information.

In one embodiment, the training module is configured to classify the language label obtained by classifying the first text information, and add the language label into the first text label data to obtain second text label data; and taking the data pair constructed by the first voice information and the second text labeling data as voice sample training data.

In an embodiment, the apparatus further includes an alignment processing module, configured to perform speech feature extraction processing on the first speech information to obtain a first speech feature; inputting the first speech feature into a CTC module; and mapping the first voice features and the corresponding language label classification in the CTC module, and then performing length alignment processing to train the language classification model based on the first voice features and the corresponding language label classification.

The functions of each module in each apparatus in the embodiment of the present application may refer to corresponding descriptions in the above method, and are not described herein again.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 7 is a block diagram of an electronic device for implementing the speech recognition method according to the embodiment of the present application. The electronic device may be the aforementioned deployment device or proxy device. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 7, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 7 illustrates an example of a processor 801.

The memory 802 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the speech recognition methods provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the speech recognition method provided by the present application.

The memory 802, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the speech recognition methods in the embodiments of the present application. The processor 801 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 802, that is, implements the voice recognition method in the above-described method embodiments.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the speech recognition method may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, as exemplified by the bus connection in fig. 7.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of speech recognition, the method comprising:

2. The method according to claim 1, wherein said performing language classification processing on the speech information to obtain language information comprises:

inputting the voice information into a trained language classification model;

and performing language classification processing at a speech frame level in the language classification model to obtain the language information.

3. The method according to claim 2, wherein said performing speech frame level language classification processing in said language classification model to obtain said language information comprises:

extracting text information corresponding to the voice information from the language classification model;

and obtaining the language information in the language classification model according to the language classification mapping relation between the voice information and each character in the text information.

4. The method according to claim 1, wherein the performing analysis processing on the sentence relationship to obtain language information for describing the sentence relationship in the speech information comprises:

inputting the voice information into a language model;

and analyzing and processing sentence relations of the text information corresponding to the voice information in the language model to obtain the language information.

5. The method according to claim 4, wherein the analyzing the sentence relationship of the text information corresponding to the speech information in the language model to obtain the language information comprises:

and analyzing and processing the statement relation in the language model according to the internal rules of the language to obtain the language information.

6. The method according to any one of claims 1 to 5, wherein performing speech recognition processing according to the language information, and the speech features to obtain a speech recognition result comprises:

inputting the language information, the language information and the voice characteristics into a joint model;

in the combined model, the language information is kept consistent with the voice feature in a vector dimension, and the obtained language information vector and the voice feature vector are spliced to obtain a vector to be processed;

and in the combined model, coding and decoding the vector to be processed based on a recurrent neural network and the language information to obtain the voice recognition result.

7. The method according to any one of claims 1-5, further comprising:

acquiring first voice information in a corpus;

performing text labeling processing on first text information corresponding to the first voice information to obtain first text labeling data;

taking a data pair constructed by the first voice information and the first text labeling data as voice sample training data;

and training the language classification model according to the speech sample training data to obtain a trained language classification model.

8. The method of claim 7, further comprising:

processing the Chinese text in the first text information into a single Chinese character;

processing English texts in the first text information into subwords;

and performing regular processing on the character sequence to be processed, which is formed by the single Chinese character and the sub-words, to obtain language label classification corresponding to each character in the first text information.

9. The method of claim 8, wherein the using the data pair constructed by the first speech information and the first text annotation data as speech sample training data comprises:

classifying the language labels obtained by classifying the first text information, and adding the language labels into the first text label data to obtain second text label data;

and taking the data pair constructed by the first voice information and the second text labeling data as voice sample training data.

10. The method of claim 9, further comprising:

extracting voice features of the first voice information to obtain first voice features;

inputting the first speech feature into a CTC module;

and mapping the first voice features and the corresponding language label classification in the CTC module, and then performing length alignment processing to train the language classification model based on the first voice features and the corresponding language label classification.

11. A speech recognition apparatus, characterized in that the apparatus comprises:

12. The apparatus of claim 11, wherein the classification module is configured to:

inputting the voice information into a trained language classification model;

13. The apparatus of claim 12, wherein the classification module is configured to:

14. The apparatus of claim 11, wherein the analysis module is configured to:

inputting the voice information into a language model;

15. The apparatus of claim 14, wherein the analysis module is configured to:

16. The apparatus according to any of claims 11-15, wherein the speech recognition module is configured to:

17. The apparatus of any one of claims 11-15, further comprising a training module to:

acquiring first voice information in a corpus;

18. The apparatus of claim 17, further comprising a language tag classification module configured to:

processing English texts in the first text information into subwords;

19. The apparatus of claim 18, wherein the training module is configured to:

20. The apparatus of claim 19, further comprising an alignment processing module to:

inputting the first speech feature into a CTC module;

21. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-10.