CN111369978B

CN111369978B - Data processing method and device for data processing

Info

Publication number: CN111369978B
Application number: CN201811603538.6A
Authority: CN
Inventors: 周盼
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2024-05-17
Anticipated expiration: 2038-12-26
Also published as: CN111369978A

Abstract

The embodiment of the invention provides a data processing method, a data processing device and a data processing device. The method specifically comprises the following steps: determining the language type of a voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained through training according to acoustic data of at least two language types; decoding the voice frame according to a decoding network corresponding to the language type of the voice frame so as to obtain a first decoding result of the voice frame; and determining a recognition result corresponding to the voice information according to the first decoding result. The embodiment of the invention can improve the accuracy of voice recognition.

Description

Data processing method and device for data processing

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus, and a device for data processing.

Background

Speech recognition technology, also known as ASR (Automatic Speech Recognition ), aims at converting the lexical content in speech into computer-readable inputs, such as keys, binary codes or character sequences.

In daily language expressions, mixed expressions of multiple languages may occur. Taking Chinese and English mixed expression as an example, the user can insert English words and sentences in the process of using the text for expression. For example, "I buy the latest iPhone", "come first YESTERDAY ONCE MORE".

However, current speech recognition technology is more accurate for single language speech recognition, and in the case that speech contains multiple languages, the accuracy of recognition is significantly reduced.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a data processing device and a data processing device, which can improve the accuracy of voice recognition under the condition that voice contains multiple languages.

In order to solve the above problems, an embodiment of the present invention discloses a data processing method, including:

Determining the language type of a voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained through training according to acoustic data of at least two language types;

decoding the voice frame according to a decoding network corresponding to the language type of the voice frame so as to obtain a first decoding result of the voice frame;

And determining a recognition result corresponding to the voice information according to the first decoding result.

In another aspect, an embodiment of the present invention discloses a data processing apparatus, including:

the type determining module is used for determining the language type of the voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained through training according to acoustic data of at least two language types;

the first decoding module is used for decoding the voice frame according to a decoding network corresponding to the language type of the voice frame so as to obtain a first decoding result of the voice frame;

And the result determining module is used for determining a recognition result corresponding to the voice information according to the first decoding result.

In yet another aspect, an embodiment of the present invention discloses an apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:

In yet another aspect, embodiments of the invention disclose a machine-readable medium having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform a data processing method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

According to the embodiment of the invention, a multi-language model can be obtained according to acoustic data training of at least two language types, and the language types of the voice frames in the voice information can be determined through the multi-language acoustic model, so that the voice frames of different language types in the voice information can be accurately distinguished under the condition that the voice information contains multiple language types, the voice frames of different language types can be decoded according to the decoding network of the corresponding language types, so that a first decoding result of the voice frames is obtained by decoding according to the decoding network corresponding to the voice types of the voice frames, the decoding accuracy can be ensured, and the accuracy of voice recognition can be further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of an embodiment of a data processing method of the present invention;

FIG. 2 is a block diagram of an embodiment of a data processing apparatus of the present invention;

FIG. 3 is a block diagram of an apparatus 800 for data processing in accordance with the present invention; and

Fig. 4 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a data processing method according to the present invention may specifically include the following steps:

step 101, determining the language type of a voice frame in voice information according to a multi-language acoustic model; the multi-language acoustic model is obtained through training according to acoustic data of at least two language types;

Step 102, decoding the voice frame according to a decoding network corresponding to the language type of the voice frame to obtain a first decoding result of the voice frame;

step 103, determining a recognition result corresponding to the voice information according to the first decoding result.

The data processing method of the embodiment of the invention can be used for identifying the scene containing the voice information of at least two language types, and can be applied to electronic equipment, wherein the electronic equipment comprises but is not limited to: servers, smartphones, tablet computers, electronic book readers, MP3 (moving picture experts compression standard audio layer 3,Moving Picture Experts Group Audio Layer III) players, MP4 (moving picture experts compression standard audio layer 4,Moving Picture Experts Group Audio Layer IV) players, laptop computers, car computers, desktop computers, set top boxes, smart televisions, wearable devices, and the like.

It can be understood that the method for obtaining the voice information to be recognized in the embodiment of the present invention is not limited, for example, the electronic device may obtain the voice information to be recognized from a client or a network in a wired connection manner or a wireless connection manner, or may obtain the voice information to be recognized by recording the electronic device in real time, or may also obtain the voice information to be recognized according to an instant communication message obtained in an instant communication application, etc.

In the embodiment of the invention, the voice information to be recognized can be segmented into a plurality of voice frames according to the preset window length and frame shift, wherein each voice frame can be a voice fragment, and then the voice information can be decoded frame by frame. If the voice information to be recognized is analog voice information (such as a recording of a user call), the analog voice information needs to be converted into digital voice information, and then the voice information is segmented.

Wherein the window length may be used to represent the duration of each frame of speech segment and the frame shift may be used to represent the time difference between adjacent frames. For example, when the window length is 25ms and the frame is shifted by 15ms, the first frame of voice segment is 0-25 ms, the second frame of voice segment is 15-40 ms, and so on, the segmentation of the voice information to be recognized can be realized. It will be appreciated that the specific window length and frame shift may be set according to the actual requirements, and embodiments of the present invention are not limited in this respect.

Optionally, before the voice information to be recognized is segmented, the electronic device may further perform noise reduction processing on the voice information to be recognized, so as to improve a subsequent processing capability on the voice information.

In the embodiment of the invention, the voice information can be input into a pre-trained multi-language acoustic model, and the voice recognition result can be obtained based on the output of the multi-language acoustic model. The multilingual acoustic model may be a classification model incorporating a plurality of neural networks. The neural network includes, but is not limited to, at least one or a combination, superposition, nesting of at least two of the following: CNN (Convolutional Neural Network ), LSTM (Long Short-Term Memory) network, RNN (Simple Recurrent Neural Network, recurrent neural network), attention neural network, and the like.

In order to improve accuracy of recognition of voice information containing multiple language types, the embodiment of the invention trains in advance according to acoustic data of at least two language types to obtain a multilingual acoustic model, and according to the multilingual acoustic model, the language type of a voice frame in voice information can be determined, so that the voice frame can be decoded according to a decoding network corresponding to the language type to obtain a first decoding result corresponding to the voice frame, and further, a recognition result corresponding to the voice information can be determined according to the first decoding result.

It will be appreciated that the embodiments of the present invention are not limited in the number of language types and the types of languages that the acoustic data training the multilingual acoustic model contains. For convenience of description, in the embodiment of the present invention, the speech information including two language types, i.e., chinese and english is taken as an example, that is, the multilingual acoustic model may be trained according to collected chinese acoustic data and english acoustic data. Of course, acoustic data of more than two language types, such as chinese, english, japanese, german, etc., may also be collected to train a multilingual acoustic model. For application scenes with more than two language types, the implementation process is similar to that of the two language types, and the two language types are mutually referred.

The decoding network of the embodiment of the invention can comprise decoding networks corresponding to at least two language types, for example, a Chinese decoding network and an English decoding network can be respectively constructed under the scene of recognizing Chinese and English mixed voice information. Specifically, a Chinese language model can be trained by collecting Chinese text corpus, and a Chinese decoding network is constructed according to knowledge sources such as a Chinese language model, a Chinese pronunciation dictionary and the like; similarly, english language material can be collected to train English language model, and English decoding network is constructed according to knowledge sources such as English language model, english pronunciation dictionary, etc.

In the process of decoding the voice information frame by frame, if the language type of the voice frame is determined to be Chinese according to the multi-language acoustic model, the voice frame can be decoded according to the Chinese decoding network, and if the language type of the voice frame is determined to be English according to the multi-language acoustic model, the voice frame can be decoded according to the English decoding network.

In one application example of the present invention, it is assumed that the voice information to be recognized is "i like apple". Specifically, firstly, the language type of a first frame of voice frame in the voice information can be determined according to a multi-language acoustic model, and the first frame of voice frame can be decoded according to a Chinese decoding network to obtain a first decoding result of the first frame of voice frame on the assumption that the language type of the first frame of voice frame is determined to be Chinese; then determining the language type of the second frame of voice frame according to the multi-language acoustic model, and inputting the second frame of voice frame into a decoding network corresponding to the language type for decoding so as to obtain a first decoding result of the second frame of voice frame; similarly, assuming that the language type of the mth frame of voice frame is English according to the multilingual acoustic model, decoding the mth frame of voice frame according to the English decoding network to obtain a first decoding result of the mth frame of voice frame until the decoding of the last frame of voice frame is completed; finally, a recognition result of the voice information may be obtained according to the first decoding result of each voice frame, for example, the recognition result may include the following text information "i like apple".

It can be seen that, according to the embodiment of the invention, the language type of the voice frame in the voice information can be determined through the trained multi-language acoustic model, so that the voice frame can be decoded according to the decoding network of the corresponding language type, and a more accurate recognition result can be obtained.

In an optional embodiment of the present invention, the determining, according to the multi-language acoustic model, a language type of a speech frame in the speech information may specifically include:

Step S11, determining posterior probability of each state corresponding to the voice frame according to the multilingual acoustic model; wherein, the states and the language types have corresponding relations;

step S12, determining probability ratio of the posterior probability of the voice frame to the state of each language type according to the posterior probability of the voice frame to each state and the language type corresponding to each state;

and step S13, determining the language type of the voice frame according to the probability ratio.

The multi-language acoustic model may convert the characteristics of the input speech frame into posterior probabilities of states, and the states may specifically be HMM (Hidden Markov Model ) states, specifically, multiple states may correspond to one phoneme, multiple phonemes may correspond to one word, and multiple words may form a sentence.

For example, it is assumed that the multi-language acoustic model may output a posterior probability corresponding to (m1+m2) states at an output layer, where the language type corresponding to M1 states may be chinese and the language type corresponding to M2 states may be english.

Inputting the voice frame into the multi-language acoustic model, outputting the posterior probability of each state corresponding to the voice frame, determining the probability ratio of the posterior probability of the voice frame to each language type state according to the posterior probability of each state corresponding to the voice frame and the language type corresponding to each state, such as the probability ratio of the Chinese state and the English state corresponding to the posterior probability of the voice frame, and determining the language type corresponding to the voice frame according to the probability ratio.

For example, the probability value obtained by adding the posterior probabilities of M1 chinese states is p1, the probability value obtained by adding the posterior probabilities of M2 english states is p2, and p1+p2=1. If p1 is greater than p2, the probability of corresponding Chinese state in the posterior probability of the speech frame is larger, the language type of the speech frame can be determined to be Chinese, otherwise, the language type of the speech frame can be determined to be English.

However, for the voice information mixed by Chinese and English, the posterior probability of English is usually smaller and rarely exceeds 0.5, so in order to reduce misjudgment, a preset threshold value can be set in the embodiment of the invention, and the language type of the voice frame is determined by comparing the probability ratio with the preset threshold value.

Taking Chinese and English mixing as an example, assuming that the probability ratio of the posterior probability of the voice frame to the English state and the Chinese state is p2/p1, if the p2/p1 exceeds a preset threshold (such as 0.25), determining that the language type of the voice frame is English; similarly, the probability ratio of the posterior probability of the speech frame to the Chinese state and the English state is p1/p2, and if the p1/p2 exceeds 4, the language type of the speech frame can be determined to be Chinese. The preset threshold value can be adjusted according to experiments, and it can be understood that the specific value of the preset threshold value is not limited in the embodiment of the invention.

Of course, since p1+p2=1, and p2/p1>0.25 is equivalent to p2>0.2, the judgment can be made simply by the value of p1 or p 2.

In a specific application, if the user frequently switches the language types or the voice information is short, judging the language type of the voice frame according to a single frame of voice frame may cause a judgment error.

In order to improve the accuracy of determining the language type of the voice frame, in an alternative embodiment of the present invention, the language type of the voice frame may be determined according to an average value of probability ratios of posterior probability of continuous voice frames within a preset window length where the voice frame is located to states of each language type.

It will be appreciated that the specific value of the preset window length is not limited in the embodiment of the present invention, for example, the preset window length may be set to a time length including 10 consecutive frames of speech frames. Specifically, continuous 10 frames of voice frames including the voice frames can be obtained, probability ratio p2/p1 of English state and Chinese state corresponding to posterior probability of each frame of voice frames in the 10 frames of voice frames is calculated respectively, and then the 10 p2/p1 are summed to average value, if the average value exceeds a preset threshold value of 0.25, the language type of the voice frames can be determined to be English, so that probability of erroneous judgment through single frame judgment is avoided, and accuracy of determining the language type of the voice frames can be improved.

In an alternative embodiment of the present invention, before the determining the language type of the speech frame in the speech information according to the multi-language acoustic model, the method may further include:

S21, determining a target language type from the at least two language types;

Step S22, decoding each voice frame in the voice information according to the decoding network corresponding to the target language type to obtain a second decoding result of each voice frame;

after the determining the language type of the speech frame in the speech information according to the multi-language acoustic model, the method may further include:

Determining a target voice frame from voice frames of the voice information, and determining a second decoding result of the target voice frame; wherein the language type of the target voice frame is a non-target language type;

the decoding the voice frame according to the decoding network corresponding to the language type of the voice frame to obtain a first decoding result of the voice frame may specifically include: decoding the target voice frame according to a decoding network corresponding to the language type of the target voice frame so as to obtain a first decoding result of the target voice frame;

The determining, according to the first decoding result, a recognition result corresponding to the voice information may specifically include: replacing the second decoding result of the target voice frame with the first decoding result of the language type corresponding to the target voice frame, and taking the replaced second decoding result as a recognition result corresponding to the voice information.

In a specific application, a user typically uses a mixture of two types of languages for expression, and most sentences use one of the two types of languages, with only a small portion of sentences interspersed with the other type of language. In addition, in the case where the voice information is short, for example, the voice information contains only one english word, there is a possibility that the decoding result is not accurate enough because one word has no context information at the time of decoding.

Thus, the embodiment of the invention can determine the target language type from the at least two language types, wherein the target language type can be a main language used in mixed language expression, for example, the target language type can be determined to be Chinese. In the process of decoding the voice information, decoding each frame of voice frame in the voice information according to the Chinese decoding network to obtain a second decoding result (such as R1) corresponding to each frame of voice frame, wherein R1 is a Chinese decoding result. Because the second decoding result is obtained by decoding a complete section of voice information, the corresponding context information of each frame of voice frame can be referred in the decoding process, so that the accuracy of the second decoding result can be improved.

After decoding all voice frames in the voice information by the decoding network corresponding to the target language type is completed, determining a target voice frame from the voice frames of the voice information; the language type of the target voice frame is a non-target language type. For example, for the voice information mixed by Chinese and English, if the target language type is determined to be Chinese, english is a non-target language type, that is, a voice frame with the language type of English can be determined from the voice information to be a target voice frame, and a first decoding result (such as R2) of the target voice frame corresponding to English is determined, where R2 is obtained by decoding the target voice frame according to an English decoding network, that is, R2 is an English decoding result. And finally, replacing the corresponding R1 with the R2 to obtain the recognition result corresponding to the voice information.

In one application example of the present invention, it is assumed that the voice information to be recognized is "i like apple", and that the target language type is chinese. Specifically, firstly inputting the voice information into a multi-language acoustic model to obtain a state posterior probability sequence corresponding to each frame of voice frame, decoding the posterior probability sequence of the Chinese state of each frame of voice frame according to a Chinese decoding network to obtain a second decoding result of each frame of voice frame, and assuming that the second decoding result of the voice information is 'I like love broken'; then, according to the posterior probability of each state corresponding to each voice frame and the language type corresponding to each state, determining the language type of each voice frame, and determining the voice frame with the language type of English as a target voice frame; decoding the target voice frame according to the English decoding network to obtain a first English decoding result corresponding to the target voice frame, wherein the first English decoding result is assumed to be 'apple'; finally, replacing "love" corresponding to "apple" in the second decoding result "i like love" with "apple", and obtaining the following text as the replaced second decoding result: "I like apple".

It should be noted that, in the embodiment of the present invention, for a speech frame whose language type is a target language type, the first decoding result and the second decoding result are the same, for example, in the above example, the speech frame corresponding to "i like" is a chinese language type, and the target language type is also a chinese language type, and then both the first decoding result and the second decoding result of the speech frame corresponding to "i like" are text "i like".

In an alternative embodiment of the present invention, the first decoding result, and the second decoding result may include: time boundary information corresponding to the speech frame;

the replacing the second decoding result of the target voice frame with the first decoding result of the language type corresponding to the target voice frame may specifically include:

step S31, determining a replaced result from the second decoding result of the target voice frame; wherein the replaced result coincides with a time boundary of a first decoding result of the language type corresponding to the target voice frame;

and S32, replacing the replaced result with a first decoding result of the language type corresponding to the target voice frame.

In order to ensure that the second decoding result of the target voice frame can be accurately replaced by the first decoding result of the language type corresponding to the target voice frame, the first decoding result and the second decoding result of the embodiment of the invention may include: time boundary information corresponding to the speech frame.

For example, in the above example, for the second decoding result "i like break", where each word includes time boundary information of a voice frame corresponding to the word, the replaced result may be determined from the second decoding result according to the time boundary information so that the replaced result coincides with the time boundary of the first decoding result of the language type corresponding to the target voice frame, and according to the above example, the first decoding result of the language type corresponding to the target voice frame is "apple", and assuming that the replaced result that coincides with the time boundary information of "apple" in the second decoding result "i like break" is determined to be "apple", the "apple" in the "i like break" may be replaced with "apple", and the replaced decoding result is "i like apple".

In an alternative embodiment of the present invention, the decoding network may specifically include: a general decoding network and a professional decoding network; wherein, the general decoding network may include: training the obtained language model according to the general text corpus; the professional decoding network may include: training the obtained language model according to the text corpus in the preset field;

The decoding the voice frame according to the decoding network corresponding to the language type of the voice frame to obtain a first decoding result of the voice frame may specifically include:

step S41, decoding the voice frame according to the general decoding network and the professional decoding network respectively to obtain a first score of the voice frame corresponding to the general decoding network and a second score of the voice frame corresponding to the professional decoding network;

And step S42, taking the decoding result with the high score in the first score and the second score as a first decoding result of the voice frame.

In a specific application, the decoding network generally has better decoding effect for the voice of the user daily communication class, however, for the voice of some professional fields, such as the voice of the medical professional field, the voice generally contains more medical professional vocabulary, such as "aspirin", "parkinson's disease" and the like, which will affect the decoding effect.

To solve the above-described problems, the decoding network of the embodiment of the present invention may include a general decoding network and a professional decoding network. The general decoding network may be a general decoding network used in a daily communication process of the user, and the general decoding network may include: according to the language model obtained through the universal text corpus training, the universal decoding network can recognize daily voices of most users. The specialized decoding network may be a decoding network customized specifically for the specialized domain, and the specialized decoding network may include: training the obtained language model according to the text corpus in the preset field; the preset field can be any field such as a medical field, a legal field, a computer field and the like.

For example, at a medical seminar, a presenter may use many sentences mixed in Chinese and English, and also use a large number of medical professional vocabularies.

Specifically, the speech of the presenter can be decoded frame by frame according to the general decoding network and the professional decoding network respectively, so as to obtain a first score of the speech frame corresponding to the general decoding network and a second score of the speech frame of the professional decoding network, and a decoding result with a high score in the first score and the second score is used as a first decoding result of the speech frame.

It may be understood that the decoding network of the embodiment of the present invention may include a plurality of decoding networks corresponding to different language types, and each decoding network of a language type may further include a general decoding network and a professional decoding network corresponding to the language type. Therefore, the embodiment of the invention can supplement or correct the decoding result of the general decoding network through the professional decoding network, and can improve the decoding accuracy under the condition that the voice information contains professional domain vocabulary.

It will be appreciated that embodiments of the present invention are not limited in the manner in which the multilingual acoustic model is trained. In an alternative embodiment of the present invention, each of the at least two language types of acoustic data corresponds to at least two language types.

Specifically, the embodiment of the invention can collect mixed acoustic data containing at least two language types to train a multi-language acoustic model, wherein the mixed acoustic data refers to each data corresponding to at least two language types. For example, the voice corresponding to "i like apple" may be a mixed acoustic data.

Training a multilingual acoustic model based on mixed acoustic data requires combining similar pronunciation units in different types of languages to generate a pronunciation dictionary that is adapted to the mixed language, however, certain errors may be introduced in combining pronunciation units. In addition, mixed acoustic data containing at least two language types is often characterized by data rarity and difficulty in collection, and therefore, the accuracy of recognition of the multilingual acoustic model will be affected.

To solve the above-mentioned problem, in an alternative embodiment of the present invention, each of the acoustic data of the at least two language types corresponds to one language type.

Specifically, the embodiment of the invention can collect the single-language acoustic data corresponding to at least two language types respectively, and train the multi-language acoustic model according to the training data set formed by the single-language data corresponding to each language type. For example, the voice corresponding to "weather today" may be a single-language acoustic data, and the voice corresponding to "What' S THE WEATHER LIKE today" may be a single-language acoustic data.

In an alternative embodiment of the present invention, the training step of the multilingual acoustic model may specifically include:

Step S51, training a single-language acoustic model corresponding to each language type according to the collected acoustic data of at least two language types;

Step S52, respectively carrying out state labeling on the acoustic data of the at least two language types according to the single-language acoustic model, wherein the states and the language types have corresponding relations;

And step S53, training a multilingual acoustic model according to a dataset formed by the acoustic data of at least two marked language types.

Specifically, a single-language acoustic model NN1 corresponding to chinese may be trained according to the collected chinese acoustic data L1, where each language type corresponding to each data in L1 is chinese. The number of HMM binding states of Chinese voice can be set to be the number of nodes of an NN1 network output layer, such as M1. The outputting of the single language acoustic model may include: the state probabilities corresponding to the language types, namely the state probabilities of M1 nodes of the network output layer, all correspond to the Chinese language types.

Likewise, a single-language acoustic model NN2 corresponding to english may be trained according to the collected english acoustic data L2, where each of the data in L2 corresponds to a language type that is english. The number of HMM binding states of English voice can be set to be the number of nodes of an NN2 network output layer, such as M2, and the state probabilities of the M2 nodes correspond to English language types.

And then, carrying out forced alignment on the Chinese acoustic data L1 and the English acoustic data L2 according to the NN1 and NN2 obtained through training respectively so as to label the states of the L1 and the L2. Specifically, the state corresponding to the voice frame of each data in L1 may be determined by NN1, and the state corresponding to the voice frame of each data in L2 may be determined by NN 2.

Finally, the labeled L1 and L2 are mixed together to obtain a labeled dataset (L1+L2) to train the multilingual acoustic model NN3. The outputting of the multilingual acoustic model may include: the state probabilities corresponding to at least two language types. For example, the number of nodes of the output layer of NN3 may be m1+m2, where the first M1 nodes may correspond to the HMM state of chinese, and the second M2 nodes may correspond to the state of english HMM.

In the process of training the multi-language acoustic model, the embodiment of the invention can use the single-language acoustic data corresponding to each language type to reserve the pronunciation characteristics of each language type, so that the method has certain differentiation on different language types in the acoustic layer. In addition, in the process of collecting training data, acoustic data of each language type are collected respectively, so that the problem of insufficient data caused by collecting mixed acoustic data of multiple language types can be avoided, and the accuracy of recognition of the multilingual acoustic model can be improved.

In summary, the embodiment of the invention can train according to the acoustic data of at least two language types to obtain a multi-language model, and can determine the language type of the voice frame in the voice information through the multi-language acoustic model, so that the embodiment of the invention can accurately distinguish the voice frames of different language types in the voice information under the condition that the voice information contains multiple language types, and can decode the voice frames according to the decoding network of the corresponding language types to obtain a first decoding result of the voice frames, wherein the first decoding result is obtained by decoding according to the decoding network corresponding to the voice type of the voice frames, thereby ensuring the decoding accuracy and further improving the accuracy of voice recognition.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Device embodiment

With reference to FIG. 2, there is shown a block diagram of an embodiment of a data processing apparatus of the present invention, which may include in particular:

a type determining module 201, configured to determine a language type of a speech frame in the speech information according to the multilingual acoustic model; the multi-language acoustic model is obtained through training according to acoustic data of at least two language types;

a first decoding module 202, configured to decode the speech frame according to a decoding network corresponding to a language type of the speech frame, so as to obtain a first decoding result of the speech frame;

And the result determining module 203 is configured to determine a recognition result corresponding to the voice information according to the first decoding result.

Optionally, the type determining module may specifically include:

The probability determination submodule is used for determining posterior probability of each state corresponding to the voice frame according to the multilingual acoustic model; wherein, the states and the language types have corresponding relations;

The ratio determining submodule is used for determining the probability ratio of the posterior probability of the voice frame to the state of each language type according to the posterior probability of the voice frame to each state and the language type corresponding to each state;

and the type determining submodule is used for determining the language type of the voice frame according to the probability ratio.

Optionally, the apparatus may further include:

The target language determining module is used for determining a target language type from the at least two language types;

the second decoding module is used for decoding each voice frame in the voice information according to the decoding network corresponding to the target language type so as to obtain a second decoding result of each voice frame;

The apparatus may further include:

a target frame determining module, configured to determine a target speech frame from speech frames of the speech information, and determine a second decoding result of the target speech frame; wherein the language type of the target voice frame is a non-target language type;

the first decoding module may specifically include:

the first decoding submodule is used for decoding the target voice frame according to a decoding network corresponding to the language type of the target voice frame so as to obtain a first decoding result of the target voice frame;

The result determining module may specifically include:

the first result determining sub-module is used for replacing the second decoding result of the target voice frame with the first decoding result of the language type corresponding to the target voice frame, and using the replaced second decoding result as a recognition result corresponding to the voice information.

Optionally, the first decoding result, and the second decoding result include: time boundary information corresponding to the speech frame;

the first result determining submodule may specifically include:

A result determination unit configured to determine a replaced result from a second decoding result of the target speech frame; wherein the replaced result coincides with a time boundary of a first decoding result of the language type corresponding to the target voice frame;

And the replacing unit is used for replacing the replaced result with a first decoding result of the language type corresponding to the target voice frame.

Optionally, the decoding network may specifically include: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training the obtained language model according to the general text corpus; the professional decoding network comprises the following steps: training the obtained language model according to the text corpus in the preset field;

the first decoding module may specifically include:

The score determining submodule is used for respectively decoding the voice frames according to the general decoding network and the professional decoding network to obtain a first score of the voice frames corresponding to the general decoding network and a second score of the voice frames corresponding to the professional decoding network;

and the second result determining submodule is used for taking the decoding result with the high score in the first score and the second score as the first decoding result of the voice frame.

Optionally, the apparatus may further include: the model training module is used for training the multilingual acoustic model; the model training module may specifically include:

the first training submodule is used for training a single-language acoustic model corresponding to each language type according to the collected acoustic data of at least two language types;

the state labeling sub-module is used for respectively labeling states of the acoustic data of the at least two language types according to the single-language acoustic model, wherein the states and the language types have a corresponding relation;

And the second training sub-module is used for training the multilingual acoustic model according to the data set formed by the acoustic data of at least two marked language types.

Optionally, each of the acoustic data of the at least two language types corresponds to at least two language types; or each of the acoustic data of the at least two language types corresponds to one language type.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

An embodiment of the present invention provides an apparatus for data processing, including a memory, and one or more programs, wherein the one or more programs are stored in the memory, and configured to be executed by one or more processors, the one or more programs comprising instructions for: determining the language type of a voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained through training according to acoustic data of at least two language types; decoding the voice frame according to a decoding network corresponding to the language type of the voice frame so as to obtain a first decoding result of the voice frame; and determining a recognition result corresponding to the voice information according to the first decoding result.

Fig. 3 is a block diagram illustrating an apparatus 800 for data processing according to an example embodiment. For example, apparatus 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 3, apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the apparatus 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on the device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 806 provides power to the various components of the device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen between the device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the apparatus 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or one component of the apparatus 800, the presence or absence of user contact with the apparatus 800, an orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the apparatus 800 and other devices, either in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of apparatus 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Fig. 4 is a schematic diagram of a server in some embodiments of the invention. The server 1900 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage mediums 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Wherein the memory 1932 and storage medium 1930 may be transitory or persistent. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, a central processor 1922 may be provided in communication with a storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

A non-transitory computer readable storage medium, which when executed by a processor of an apparatus (server or terminal) enables the apparatus to perform the data processing method shown in fig. 1.

A non-transitory computer readable storage medium, which when executed by a processor of an apparatus (server or terminal), causes the apparatus to perform a data processing method, the method comprising: determining the language type of a voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained through training according to acoustic data of at least two language types; decoding the voice frame according to a decoding network corresponding to the language type of the voice frame so as to obtain a first decoding result of the voice frame; and determining a recognition result corresponding to the voice information according to the first decoding result.

The embodiment of the invention discloses A1, a data processing method, which comprises the following steps: determining the language type of a voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained through training according to acoustic data of at least two language types;

A2, determining the language type of the voice frame in the voice information according to the method of A1, wherein the method comprises the following steps:

determining posterior probability of each state corresponding to the voice frame according to the multilingual acoustic model; wherein, the states and the language types have corresponding relations;

Determining the probability ratio of the posterior probability of the voice frame to the state of each language type according to the posterior probability of the voice frame to each state and the language type corresponding to each state;

and determining the language type of the voice frame according to the probability ratio.

A3, before the language type of the voice frame in the voice information is determined according to the multi-language acoustic model according to the method of A1, the method further comprises:

determining a target language type from the at least two language types;

decoding each voice frame in the voice information according to the decoding network corresponding to the target language type to obtain a second decoding result of each voice frame;

After the determining the language type of the speech frame in the speech information according to the multi-language acoustic model, the method further comprises:

The decoding network, which corresponds to the language type of the voice frame, decodes the voice frame to obtain a first decoding result of the voice frame, including:

decoding the target voice frame according to a decoding network corresponding to the language type of the target voice frame so as to obtain a first decoding result of the target voice frame;

the determining, according to the first decoding result, a recognition result corresponding to the voice information includes:

replacing the second decoding result of the target voice frame with the first decoding result of the language type corresponding to the target voice frame, and taking the replaced second decoding result as a recognition result corresponding to the voice information.

A4, the method according to A3, wherein the first decoding result and the second decoding result include: time boundary information corresponding to the speech frame;

the replacing the second decoding result of the target voice frame with the first decoding result of the language type corresponding to the target voice frame includes:

determining a replaced result from the second decoding result of the target voice frame; wherein the replaced result coincides with a time boundary of a first decoding result of the language type corresponding to the target voice frame;

And replacing the replaced result with a first decoding result of the language type corresponding to the target voice frame.

A5, the method of A1, the decoding network comprising: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training the obtained language model according to the general text corpus; the professional decoding network comprises the following steps: training the obtained language model according to the text corpus in the preset field;

decoding the voice frames according to the general decoding network and the professional decoding network respectively to obtain a first score of the voice frames corresponding to the general decoding network and a second score of the voice frames corresponding to the professional decoding network;

And taking the decoding result with the high score in the first score and the second score as a first decoding result of the voice frame.

A6, the method according to A1, wherein the training step of the multilingual acoustic model comprises the following steps:

Respectively training a single-language acoustic model corresponding to each language type according to the collected acoustic data of at least two language types;

Respectively carrying out state labeling on the acoustic data of the at least two language types according to the single-language acoustic model, wherein the states and the language types have corresponding relations;

Training a multilingual acoustic model according to a dataset consisting of the acoustic data of the at least two noted language types.

A7, according to the method of any one of A1 to A6, each data in the acoustic data of at least two language types corresponds to at least two language types; or each of the acoustic data of the at least two language types corresponds to one language type.

The embodiment of the invention discloses a B8 data processing device, which comprises:

B9, the apparatus of B8, the type determining module comprising:

B10, the apparatus of B8, the apparatus further comprising:

The apparatus further comprises:

the first decoding module includes:

The result determining module includes:

B11, the apparatus of B10, the first decoding result, and the second decoding result include: time boundary information corresponding to the speech frame;

The first result determination submodule includes:

B12, the apparatus of B8, the decoding network comprising: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training the obtained language model according to the general text corpus; the professional decoding network comprises the following steps: training the obtained language model according to the text corpus in the preset field;

the first decoding module includes:

B13, the apparatus of B8, the apparatus further comprising: the model training module is used for training the multilingual acoustic model; the model training module comprises:

B14, the apparatus of any one of B8 to B13, each of the acoustic data of the at least two language types corresponding to at least two language types; or each of the acoustic data of the at least two language types corresponds to one language type.

The embodiment of the invention discloses a C15, a device for data processing, which is characterized by comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, and the one or more programs comprise instructions for:

C16, the device according to C15, the language type of the voice frame in the voice information is determined according to the multi-language acoustic model, including:

C17, the device of C15, the device further configured to be executed by one or more processors, the one or more programs comprising instructions for:

determining a target language type from the at least two language types;

The device is also configured to be executed by one or more processors the one or more programs including instructions for:

C18, the apparatus of C17, the first decoding result, and the second decoding result comprising: time boundary information corresponding to the speech frame;

C19, the apparatus of C15, the decoding network comprising: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training the obtained language model according to the general text corpus; the professional decoding network comprises the following steps: training the obtained language model according to the text corpus in the preset field;

C20, the device according to C15, the training step of the multilingual acoustic model comprises:

C21, means according to any one of C15 to C20, each of the acoustic data of the at least two language types corresponding to at least two language types; or each of the acoustic data of the at least two language types corresponds to one language type.

Embodiments of the invention disclose D22, a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a data processing method as described in one or more of A1 to A7.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

The foregoing has outlined a data processing method, a data processing device and a device for data processing in detail, wherein specific examples are provided herein to illustrate the principles and embodiments of the present invention, and the above examples are provided to assist in understanding the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of data processing, the method comprising:

According to the multi-language acoustic model, determining the respective language type of each voice frame in the voice information to be recognized; the multi-language acoustic model is obtained by training acoustic data according to at least two language types, the voice frames are obtained by segmenting the voice information, and the number of the voice frames is multiple;

determining a recognition result corresponding to the voice information according to the first decoding result of each voice frame;

before the determining the language type of the speech frame in the speech information according to the multi-language acoustic model, the method further comprises:

determining a target language type from the at least two language types;

The determining, according to the first decoding result of each voice frame, a recognition result corresponding to the voice information includes:

2. The method of claim 1, wherein the determining, based on the multilingual acoustic model, a respective language type of each speech frame in the speech information to be recognized, comprises:

3. The method of claim 1, wherein the first decoding result, and the second decoding result comprise: time boundary information corresponding to the speech frame;

4. The method of claim 1, wherein the decoding network comprises: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training the obtained language model according to the general text corpus; the professional decoding network comprises the following steps: training the obtained language model according to the text corpus in the preset field;

5. The method of claim 1, wherein the training step of the multilingual acoustic model comprises:

6. The method of any one of claims 1 to 5, wherein each of the acoustic data of the at least two language types corresponds to at least two language types; or each of the acoustic data of the at least two language types corresponds to one language type.

7. A data processing apparatus, the apparatus comprising:

The type determining module is used for respectively determining the respective language type of each voice frame in the voice information to be recognized according to the multi-language acoustic model; the multi-language acoustic model is obtained by training acoustic data according to at least two language types, the voice frames are obtained by segmenting the voice information, and the number of the voice frames is multiple;

the result determining module is used for determining a recognition result corresponding to the voice information according to the first decoding result of each voice frame;

The apparatus further comprises:

the first decoding module includes:

The result determining module includes:

8. The apparatus of claim 7, wherein the type determination module comprises:

9. The apparatus of claim 7, wherein the first decoding result, and the second decoding result comprise: time boundary information corresponding to the speech frame;

The first result determination submodule includes:

10. The apparatus of claim 7, wherein the decoding network comprises: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training the obtained language model according to the general text corpus; the professional decoding network comprises the following steps: training the obtained language model according to the text corpus in the preset field;

the first decoding module includes:

11. The apparatus of claim 7, wherein the apparatus further comprises: the model training module is used for training the multilingual acoustic model; the model training module comprises:

12. The apparatus according to any one of claims 7 to 11, wherein each of the acoustic data of the at least two language types corresponds to at least two language types; or each of the acoustic data of the at least two language types corresponds to one language type.

13. An apparatus for data processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:

The device is also configured to execute, by one or more processors, the one or more programs including instructions for:

determining a target language type from the at least two language types;

14. The apparatus of claim 13, wherein the determining, based on the multilingual acoustic model, a respective language type of each speech frame in the speech information to be recognized, comprises:

15. The apparatus of claim 13, wherein the first decoding result, and the second decoding result comprise: time boundary information corresponding to the speech frame;

16. The apparatus of claim 13, wherein the decoding network comprises: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training the obtained language model according to the general text corpus; the professional decoding network comprises the following steps: training the obtained language model according to the text corpus in the preset field;

17. The apparatus of claim 13, wherein the training of the multilingual acoustic model comprises:

18. The apparatus of any one of claims 13 to 17, wherein each of the acoustic data of the at least two language types corresponds to at least two language types; or each of the acoustic data of the at least two language types corresponds to one language type.

19. A machine readable medium having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the data processing method of one or more of claims 1 to 6.