CN111369978A

CN111369978A - Data processing method and device and data processing device

Info

Publication number: CN111369978A
Application number: CN201811603538.6A
Authority: CN
Inventors: 周盼
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2020-07-03
Anticipated expiration: 2038-12-26
Also published as: CN111369978B

Abstract

The embodiment of the invention provides a data processing method and device and a device for data processing. The method specifically comprises the following steps: determining the language type of a voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained by training according to acoustic data of at least two language types; decoding the voice frame according to a decoding network corresponding to the language type of the voice frame to obtain a first decoding result of the voice frame; and determining the recognition result corresponding to the voice information according to the first decoding result. The embodiment of the invention can improve the accuracy of voice recognition.

Description

Data processing method and device and data processing device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus, and an apparatus for data processing.

Background

Speech Recognition technology, also known as ASR (Automatic Speech Recognition), aims at converting the vocabulary content in Speech into computer-readable input, such as keystrokes, binary codes or character sequences.

In the daily language expression, a case where a plurality of languages are mixed may occur. Taking the mixed expression of Chinese and English as an example, the user can use English words and sentences alternately in the process of using Chinese to express. For example, "I bought the latest iPhone", "from a Yesterday once more".

However, the current speech recognition technology is more accurate for speech recognition of a single language, and the recognition accuracy is significantly reduced when the speech includes multiple languages.

Disclosure of Invention

Embodiments of the present invention provide a data processing method and apparatus, and an apparatus for data processing, which can improve accuracy of speech recognition when speech includes multiple languages.

In order to solve the above problem, an embodiment of the present invention discloses a data processing method, where the method includes:

determining the language type of a voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained by training according to acoustic data of at least two language types;

decoding the voice frame according to a decoding network corresponding to the language type of the voice frame to obtain a first decoding result of the voice frame;

and determining the recognition result corresponding to the voice information according to the first decoding result.

In another aspect, an embodiment of the present invention discloses a data processing apparatus, where the apparatus includes:

the type determining module is used for determining the language type of a voice frame in the voice information according to the multilingual acoustic model; the multi-language acoustic model is obtained by training according to acoustic data of at least two language types;

the first decoding module is used for decoding the voice frame according to a decoding network corresponding to the language type of the voice frame to obtain a first decoding result of the voice frame;

and the result determining module is used for determining the recognition result corresponding to the voice information according to the first decoding result.

In yet another aspect, an embodiment of the present invention discloses an apparatus for data processing, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by the one or more processors includes instructions for:

In yet another aspect, an embodiment of the invention discloses a machine-readable medium having stored thereon instructions, which, when executed by one or more processors, cause an apparatus to perform a data processing method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

the embodiment of the invention can obtain a multi-language model according to the acoustic data training of at least two language types, and can determine the language type of the speech frame in the speech information through the multi-language acoustic model, so that under the condition that the speech information comprises multiple language types, the embodiment of the invention can accurately distinguish the speech frames of different language types in the speech information, and can decode the speech frame according to the decoding network of the corresponding language type to obtain the first decoding result of the speech frame, wherein the first decoding result is obtained by decoding according to the decoding network corresponding to the speech type of the speech frame, the decoding accuracy can be ensured, and the accuracy of speech recognition can be further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of one data processing method embodiment of the present invention;

FIG. 2 is a block diagram of an embodiment of a data processing apparatus according to the present invention;

FIG. 3 is a block diagram of an apparatus 800 for data processing of the present invention; and

fig. 4 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a data processing method according to the present invention is shown, which may specifically include the following steps:

step 101, determining the language type of a voice frame in voice information according to a multi-language acoustic model; the multi-language acoustic model is obtained by training according to acoustic data of at least two language types;

step 102, decoding the voice frame according to a decoding network corresponding to the language type of the voice frame to obtain a first decoding result of the voice frame;

and 103, determining a recognition result corresponding to the voice information according to the first decoding result.

The data processing method of the embodiment of the invention can be used for recognizing the voice information containing at least two language types, and can be applied to electronic equipment, including but not limited to: a server, a smart phone, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a car computer, a desktop computer, a set-top box, a smart tv, a wearable device, and so on.

It can be understood that the obtaining manner of the voice information to be recognized in the embodiment of the present invention is not limited, for example, the electronic device may obtain the voice information to be recognized from the client or the network in a wired connection manner or a wireless connection manner, or may obtain the voice information to be recognized by recording the voice information to be recognized in real time by the electronic device, or may obtain the voice information to be recognized according to the instant messaging message obtained in the instant messaging application.

In the embodiment of the present invention, the speech information to be recognized may be segmented into a plurality of speech frames according to a preset window length and frame shift, where each speech frame may be a speech segment, and the speech information may be decoded frame by frame. If the voice information to be recognized is analog voice information (for example, a recording of a user call), the analog voice information needs to be converted into digital voice information, and then the voice information needs to be segmented.

Wherein the window length may be used to represent the duration of each frame of the speech segment and the frame shift may be used to represent the time difference between adjacent frames. For example, when the window length is 25ms and the frame length is 15ms, the first frame voice segment is 0-25 ms, the second frame voice segment is 15-40 ms, and so on, the segmentation of the voice information to be recognized can be realized. It is understood that the specific window length and frame shift can be set according to actual requirements, and the embodiment of the present invention is not limited thereto.

Optionally, before segmenting the speech information to be recognized, the electronic device may further perform noise reduction processing on the speech information to be recognized, so as to improve subsequent processing capability on the speech information.

In the embodiment of the invention, the voice information can be input into the pre-trained multi-language acoustic model, and the voice recognition result is obtained based on the output of the multi-language acoustic model. The multilingual acoustic model may be a classification model that merges multiple neural networks. The neural network includes, but is not limited to, at least one or a combination, superposition, nesting of at least two of the following: CNN (Convolutional Neural Network), LSTM (Long Short-term memory) Network, RNN (Simple current Neural Network), attention Neural Network, and the like.

In order to improve the accuracy of recognizing the speech information containing multiple language types, the embodiment of the invention trains and obtains a multilingual acoustic model according to the acoustic data of at least two language types in advance, and the language type of a speech frame in the speech information can be determined according to the multilingual acoustic model, so that the speech frame can be decoded according to a decoding network corresponding to the language type to obtain a first decoding result corresponding to the speech frame, and further, the recognition result corresponding to the speech information can be determined according to the first decoding result.

It is understood that the number of language types and the language types included in the acoustic data for training the multilingual acoustic model are not limited by the embodiments of the present invention. For convenience of description, in the embodiments of the present invention, speech information including both chinese language types and english language types is taken as an example for illustration, that is, the multilingual acoustic model may be obtained by training according to collected chinese acoustic data and english acoustic data. Of course, acoustic data of more than two language types, such as chinese, english, japanese, german, etc., may also be collected to train the multilingual acoustic model. For application scenes with more than two language types, the implementation process is similar to that of the two language types, and the two language types are mutually referred.

The decoding network of the embodiment of the invention can comprise decoding networks corresponding to at least two language types, for example, a Chinese decoding network and an English decoding network can be respectively constructed in the scene of recognizing Chinese and English mixed voice information. Specifically, a Chinese language model can be trained by collecting Chinese text corpora, and a Chinese decoding network is constructed according to knowledge sources such as the Chinese language model and a Chinese pronunciation dictionary; similarly, an English language model can be trained by collecting English text corpora, and an English decoding network can be constructed according to knowledge sources such as the English language model and an English pronunciation dictionary.

In the process of decoding the voice information frame by frame, if the language type of the voice frame is determined to be Chinese according to the multilingual acoustic model, the voice frame can be decoded according to a Chinese decoding network, and if the language type of the voice frame is determined to be English according to the multilingual acoustic model, the voice frame can be decoded according to an English decoding network.

In one application example of the present invention, it is assumed that the speech information to be recognized is "i like an applet". Specifically, the language type of a first frame of speech frame in the speech information can be determined according to a multilingual acoustic model, and if the language type of the first frame of speech frame is determined to be Chinese, the first frame of speech frame can be decoded according to a Chinese decoding network to obtain a first decoding result of the first frame of speech frame; then, according to the multi-language acoustic model, determining the language type of the second frame of voice frame, and inputting the second frame of voice frame into a decoding network corresponding to the language type of the second frame of voice frame for decoding to obtain a first decoding result of the second frame of voice frame; by analogy, if the language type of the mth frame of voice frame is determined to be English according to the multi-language acoustic model, the mth frame of voice frame can be decoded according to an English decoding network to obtain a first decoding result of the mth frame of voice frame until the decoding of the last frame of voice frame is completed; finally, a recognition result of the speech information may be obtained according to the first decoding result of each speech frame, for example, the recognition result may include the following text information "i like an applet".

Therefore, the language type of the voice frame in the voice information can be determined through the trained multilingual acoustic model, so that the voice frame can be decoded according to the decoding network of the corresponding language type to obtain a more accurate recognition result.

In an optional embodiment of the present invention, the determining the language type of the speech frame in the speech information according to the multilingual acoustic model specifically includes:

step S11, determining the posterior probability of each state corresponding to the voice frame according to the multilingual acoustic model; wherein, the state and the language type have a corresponding relation;

step S12, determining the probability ratio of the posterior probability of the speech frame corresponding to each language type state according to the posterior probability of each state corresponding to the speech frame and the language type corresponding to each state;

and step S13, determining the language type of the speech frame according to the probability ratio.

The multilingual acoustic Model may convert the characteristics of the input speech frame into a posterior probability of each state, where the state may specifically be an HMM (Hidden Markov Model), and specifically, a plurality of states may correspond to one phoneme, a plurality of phonemes may correspond to one word, and a plurality of words may form one sentence.

For example, it is assumed that the multilingual acoustic model can output posterior probabilities corresponding to (M1+ M2) states at an output layer, where the language types corresponding to M1 states may be chinese and the language types corresponding to M2 states may be english.

Inputting a speech frame into the multilingual acoustic model, outputting the posterior probability of each state corresponding to the speech frame, determining the probability ratio of each language type state corresponding to the posterior probability of the speech frame according to the posterior probability of each state corresponding to the speech frame and the language type corresponding to each state, and determining the language type corresponding to the speech frame according to the probability ratio, for example, the probability ratio of the Chinese state and the English state corresponding to the posterior probability of the speech frame.

For example, the probability value obtained by adding the posterior probabilities of M1 chinese states is p1, the probability value obtained by adding the posterior probabilities of M2 english states is p2, and p1+ p2 is 1. If p1 is greater than p2, it indicates that the probability of the corresponding chinese state in the a posteriori probabilities of the speech frame is greater, the language type of the speech frame may be determined to be chinese, and conversely, the language type of the speech frame may be determined to be english.

However, for the speech information mixed with chinese and english, the posterior probability of english is usually smaller and rarely exceeds 0.5, so in order to reduce the misjudgment, the embodiment of the present invention may set a preset threshold, and determine the language type of the speech frame by comparing the probability ratio with the preset threshold.

Taking Chinese-English mixing as an example, assuming that the probability ratio of the posterior probability of a speech frame corresponding to the English state to the Chinese state is p2/p1, if p2/p1 exceeds a preset threshold (e.g., 0.25), the language type of the speech frame can be determined to be English; similarly, the probability ratio of the posterior probability of the speech frame corresponding to the Chinese state and the English state is p1/p2, if p1/p2 exceeds 4, the language type of the speech frame can be determined to be Chinese. The preset threshold may be adjusted according to experiments, and it can be understood that the specific value of the preset threshold is not limited in the embodiment of the present invention.

Of course, since p1+ p2 is 1 and p2/p1>0.25 is equivalent to p2>0.2, it can be determined simply by the value of p1 or p 2.

In a specific application, if a user frequently switches a language type or a voice message is short, the language type of the voice frame is determined according to a single-frame voice frame, which may cause a determination error.

In order to improve the accuracy of determining the language type of the speech frame, in an optional embodiment of the present invention, the language type of the speech frame may be determined according to an average value of probability ratios of the posterior probabilities of the consecutive speech frames within the preset window length in which the speech frame is located to the respective language type states.

It is to be understood that the specific value of the preset window length is not limited in the embodiment of the present invention, for example, the preset window length may be set to a time length including 10 consecutive frames of speech. Specifically, 10 continuous speech frames including the speech frame may be obtained, the probability ratio p2/p1 of the posterior probability corresponding to the english state and the chinese state of each of the 10 speech frames is calculated respectively, then the 10 p2/p1 are summed to obtain an average value, and if the average value exceeds a preset threshold value of 0.25, the language type of the speech frame may be determined to be english, so as to avoid the probability of erroneous judgment through single frame judgment, and further improve the accuracy of determining the language type of the speech frame.

In an optional embodiment of the invention, before said determining the language type of a speech frame in the speech information according to the multilingual acoustic model, the method may further comprise:

step S21, determining a target language type from the at least two language types;

step S22, decoding each voice frame in the voice information according to the decoding network corresponding to the target language type to obtain a second decoding result of each voice frame;

after determining the language type of a speech frame in the speech information according to the multilingual acoustic model, the method may further comprise:

determining a target speech frame from the speech frames of the speech information, and determining a second decoding result of the target speech frame; the language type of the target voice frame is a non-target language type;

the decoding, according to the decoding network corresponding to the language type of the speech frame, the speech frame to obtain a first decoding result of the speech frame may specifically include: decoding the target voice frame according to a decoding network corresponding to the language type of the target voice frame to obtain a first decoding result of the target voice frame;

the determining, according to the first decoding result, the recognition result corresponding to the voice information may specifically include: and replacing the second decoding result of the target speech frame with the first decoding result of the language type corresponding to the target speech frame, and taking the replaced second decoding result as the recognition result corresponding to the speech information.

In a specific application, a user usually expresses the sentence by using a mixture of two types of languages, most sentences use one language type, and only a small part of sentences are interspersed with the other language type. In addition, in the case that the speech information is short, for example, the speech information only contains one english word, the decoding result may not be accurate because one word has no context information during decoding.

Therefore, the embodiment of the present invention may determine the target language type from the at least two language types, where the target language type may be a main language used in a mixed language expression, for example, the target language type may be determined as chinese. In the process of decoding the voice information, for each frame of voice frame in the voice information, decoding is performed according to the chinese decoding network to obtain a second decoding result (e.g., R1) corresponding to each frame of voice frame, where R1 is the chinese decoding result. The second decoding result is obtained by decoding a section of complete speech information, and each frame of speech frame can refer to the corresponding context information in the decoding process, so that the accuracy of the second decoding result can be improved.

After the decoding network corresponding to the target language type completes the decoding of all the voice frames in the voice information, the target voice frame can be determined from the voice frames of the voice information; and the language type of the target speech frame is a non-target language type. For example, for a speech information mixed with chinese and english, if the target language type is determined to be chinese, then english is a non-target language type, that is, a speech frame with the language type being english is determined as a target speech frame from the speech information, and the target speech frame is determined to correspond to a first decoding result (for example, R2), where R2 is obtained by decoding the target speech frame according to an english decoding network, that is, R2 is an english decoding result. Finally, replacing the corresponding R1 with R2 to obtain the recognition result corresponding to the voice information.

In one application example of the present invention, it is assumed that the speech information to be recognized is "i like an applet", and that the target language type is chinese. Specifically, the speech information is firstly input into a multi-language acoustic model to obtain a state posterior probability sequence corresponding to each frame of speech frame, the Chinese state posterior probability sequence of each frame of speech frame is decoded according to a Chinese decoding network to obtain a second decoding result of each frame of speech frame, and the second decoding result of the speech information is assumed to be 'i like love and break'; then, determining the language type of each voice frame according to the posterior probability of each state corresponding to each voice frame and the language type corresponding to each state, and determining the voice frame with the language type of English as a target voice frame; then, decoding a target voice frame according to an English decoding network to obtain a first decoding result of the target voice frame corresponding to English, wherein the first decoding result is assumed to be 'applet'; finally, replacing the "love breaker" corresponding to the "applet" in the second decoding result "i like love breaker" with "applet", and obtaining the replaced second decoding result as the following text: "I like an applet".

It should be noted that, in the embodiment of the present invention, for a speech frame whose language type is a target language type, the first decoding result and the second decoding result are the same, for example, in the above example, "i like" the corresponding speech frame, whose language type is chinese, and the target language type is also chinese, then "i like" both the first decoding result and the second decoding result of the corresponding speech frame are the text "i like".

In an optional embodiment of the present invention, the first decoding result and the second decoding result may include: time boundary information of the corresponding voice frame;

the replacing the second decoding result of the target speech frame with the first decoding result of the language type corresponding to the target speech frame may specifically include:

step S31, determining a replaced result from the second decoding result of the target speech frame; wherein the replaced result coincides with a time boundary of a first decoding result of a language type corresponding to the target speech frame;

and step S32, replacing the replaced result with the first decoding result of the language type corresponding to the target voice frame.

In order to ensure that the second decoding result of the target speech frame can be accurately replaced by the first decoding result of the language type corresponding to the target speech frame, the first decoding result and the second decoding result of the embodiment of the present invention may include: time boundary information of the corresponding speech frame.

For example, in the above example, for the second decoding result "i like love burst", where each word includes the time boundary information of the speech frame corresponding to the word, the replaced result may be determined from the second decoding result according to the time boundary information, so that the replaced result coincides with the time boundary of the first decoding result of the language type corresponding to the target speech frame, as can be seen from the above example, the first decoding result of the language type corresponding to the target speech frame is "applet", and assuming that the replaced result coinciding with the time boundary information of "applet" in the second decoding result "i like love burst" is determined as "love burst", the "love in" i like burst "may be replaced with" applet ", and the decoding result after replacement is" i like applet ".

In an optional embodiment of the present invention, the decoding network may specifically include: a general decoding network and a professional decoding network; wherein, the general decoding network may include: training a language model according to the universal text corpus; the professional decoding network can comprise: training a language model according to the text corpus of the preset field;

the decoding, according to the decoding network corresponding to the language type of the speech frame, the speech frame to obtain a first decoding result of the speech frame may specifically include:

step S41, decoding the voice frame according to the general decoding network and the professional decoding network respectively to obtain a first score of the voice frame corresponding to the general decoding network and a second score of the voice frame corresponding to the professional decoding network;

and step S42, using the decoding result with the highest score in the first score and the second score as the first decoding result of the voice frame.

In a specific application, for the user's daily traffic-like speech, the decoding network usually has a better decoding effect, however, for some professional fields, such as the medical professional field, the speech usually contains more medical professional vocabularies, such as "aspirin", "parkinson's disease", etc., which will affect the decoding effect.

To solve the above problem, the decoding network according to the embodiment of the present invention may include a general decoding network and a professional decoding network. The general decoding network may be a general decoding network used in the daily communication process of the user, and the general decoding network may include: the language model is obtained by training according to the universal text corpus, so that the universal decoding network can recognize daily voice of most users. The professional decoding network can be a decoding network specially customized for the professional field, and the professional decoding network can comprise: training a language model according to the text corpus of the preset field; the preset field can be any field such as medical field, legal field, computer field and the like.

For example, at a medical seminar, a speaker may use many sentences mixed with Chinese and English words and a large amount of medical professional vocabularies, and the embodiment of the invention can recognize the speech of the speaker as characters in real time and display the characters on a large screen for the participants to watch.

Specifically, the speech of the speaker may be decoded frame by frame according to the generic decoding network and the professional decoding network, respectively, to obtain a first score of the speech frame corresponding to the generic decoding network and a second score of the speech frame corresponding to the professional decoding network, and a decoding result with a higher score of the first score and the second score is used as a first decoding result of the speech frame.

It can be understood that the decoding network of the embodiment of the present invention may include a plurality of decoding networks corresponding to different language types, and each decoding network of a language type may further include a general decoding network and a professional decoding network corresponding to the language type. Therefore, the embodiment of the invention can supplement or correct the decoding result of the general decoding network through the professional decoding network, and can improve the decoding accuracy under the condition that the voice information contains professional domain words.

It is understood that the embodiment of the present invention does not limit the training manner for training the multilingual acoustic models. In an optional embodiment of the invention, each of the acoustic data of the at least two language types corresponds to at least two language types.

Specifically, embodiments of the present invention may collect hybrid acoustic data comprising at least two language types, each corresponding to at least two language types, to train a multilingual acoustic model. For example, the speech corresponding to "i like an applet" may be a mixed acoustic data.

Training a multilingual acoustic model according to the mixed acoustic data requires combining similar pronunciation units in different languages to generate a pronunciation dictionary suitable for the mixed languages, but certain errors may be brought in the process of combining the pronunciation units. In addition, mixed acoustic data containing at least two language types is often characterized by rare data and difficult collection, and therefore, the accuracy of multilingual acoustic model recognition will be affected.

To solve the above problem, in an alternative embodiment of the present invention, each of the acoustic data of the at least two language types corresponds to one language type.

Specifically, the embodiments of the present invention may collect monolingual acoustic data corresponding to at least two language types, respectively, and train the multilingual acoustic model according to a training data set composed of the monolingual acoustic data corresponding to each language type. For example, the speech corresponding to "weather is good today" may be a single language acoustic data, and the speech corresponding to "at's the weather likeketoday" may also be a single language acoustic data.

In an optional embodiment of the present invention, the step of training the multilingual acoustic models may specifically include:

step S51, respectively training a monolingual acoustic model corresponding to each language type according to the collected acoustic data of at least two language types;

step S52, respectively labeling states of the acoustic data of the at least two language types according to the single language acoustic model, wherein the states and the language types have corresponding relations;

and step S53, training the multilingual acoustic model according to a data set formed by the acoustic data of at least two language types after being labeled.

Specifically, a monolingual acoustic model NN1 corresponding to chinese may be trained according to the collected chinese acoustic data L1, where the language type corresponding to each piece of data in L1 is chinese. The number of HMM binding states of the chinese speech can be set to be the number of nodes of the NN1 network output layer, such as M1. The output of the monolingual acoustic model may include: the state probabilities corresponding to one language type, that is, the state probabilities of M1 nodes of the network output layer all correspond to a chinese language type.

Similarly, an english-corresponding monolingual acoustic model NN2 may be trained according to the collected english acoustic data L2, where the language type corresponding to each piece of data in L2 is english. The HMM binding state number of the english speech may be set to be the node number of the NN2 network output layer, such as M2, and the state probabilities of M2 nodes all correspond to english language types.

Then, the chinese acoustic data L1 and the english acoustic data L2 are forcibly aligned according to the trained NN1 and NN2, respectively, to perform state labeling on L1 and L2. Specifically, the state corresponding to the speech frame of each data in L1 may be determined by NN1, and the state corresponding to the speech frame of each data in L2 may be determined by NN 2.

Finally, the annotated L1 and L2 were blended together to obtain an annotated dataset (L1+ L2) to train the multilingual acoustic model NN 3. The output of the multilingual acoustic model may include: state probabilities corresponding to at least two language types. For example, the number of nodes of the output layer of the NN3 may be M1+ M2, where the first M1 nodes may correspond to states of a chinese HMM, and the last M2 nodes may correspond to states of an english HMM.

In the process of training the multilingual acoustic model, the embodiment of the invention can use the monolingual acoustic data corresponding to each language type to reserve the pronunciation characteristics of each language type, so that the method can have certain distinctiveness on different language types in an acoustic level. In addition, in the process of collecting training data, acoustic data of all language types are collected respectively, so that the problem of insufficient data caused by collecting mixed acoustic data of multiple language types can be avoided, and the accuracy of multilingual acoustic model identification can be improved.

To sum up, the embodiment of the present invention can obtain a multi-language model according to acoustic data training of at least two language types, and the language type of a speech frame in speech information can be determined through the multi-language acoustic model, so that, under the condition that the speech information includes multiple language types, the embodiment of the present invention can accurately distinguish the speech frames of different language types in the speech information, and can decode the speech frame according to a decoding network of a corresponding language type to obtain a first decoding result of the speech frame, which is obtained according to the decoding network corresponding to the speech type of the speech frame, so that the accuracy of decoding can be ensured, and the accuracy of speech recognition can be further improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Device embodiment

Referring to fig. 2, a block diagram of a data processing apparatus according to an embodiment of the present invention is shown, where the apparatus may specifically include:

a type determining module 201, configured to determine a language type of a speech frame in the speech information according to the multilingual acoustic model; the multi-language acoustic model is obtained by training according to acoustic data of at least two language types;

a first decoding module 202, configured to decode the voice frame according to a decoding network corresponding to a language type of the voice frame, so as to obtain a first decoding result of the voice frame;

and the result determining module 203 is configured to determine, according to the first decoding result, a recognition result corresponding to the voice information.

Optionally, the type determining module may specifically include:

the probability determination submodule is used for determining the posterior probability of each state corresponding to the voice frame according to the multi-language acoustic model; wherein, the state and the language type have a corresponding relation;

the ratio determining submodule is used for determining the probability ratio of the posterior probability of the voice frame corresponding to each language type state according to the posterior probability of each state corresponding to the voice frame and the language type corresponding to each state;

and the type determining submodule is used for determining the language type of the voice frame according to the probability ratio.

Optionally, the apparatus may further include:

the target language determining module is used for determining a target language type from the at least two language types;

the second decoding module is used for decoding each voice frame in the voice information according to the decoding network corresponding to the target language type so as to obtain a second decoding result of each voice frame;

the apparatus may further include:

a target frame determining module, configured to determine a target speech frame from the speech frames of the speech information, and determine a second decoding result of the target speech frame; the language type of the target voice frame is a non-target language type;

the first decoding module may specifically include:

the first decoding submodule is used for decoding the target voice frame according to a decoding network corresponding to the language type of the target voice frame so as to obtain a first decoding result of the target voice frame;

the result determining module may specifically include:

and the first result determining submodule is used for replacing the second decoding result of the target voice frame with the first decoding result of the language type corresponding to the target voice frame, and taking the replaced second decoding result as the recognition result corresponding to the voice information.

Optionally, the first decoding result and the second decoding result include: time boundary information of the corresponding voice frame;

the first result determination sub-module may specifically include:

a result determining unit, configured to determine a replaced result from a second decoding result of the target speech frame; wherein the replaced result coincides with a time boundary of a first decoding result of a language type corresponding to the target speech frame;

and the replacing unit is used for replacing the replaced result with a first decoding result of the language type corresponding to the target voice frame.

Optionally, the decoding network may specifically include: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training a language model according to the universal text corpus; the professional decoding network comprises: training a language model according to the text corpus of the preset field;

the first decoding module may specifically include:

the score determining submodule is used for decoding the voice frame according to the general decoding network and the professional decoding network respectively to obtain a first score of the voice frame corresponding to the general decoding network and a second score of the voice frame corresponding to the professional decoding network;

and the second result determining submodule is used for taking the decoding result with the higher score in the first score and the second score as the first decoding result of the voice frame.

Optionally, the apparatus may further include: a model training module for training the multilingual acoustic models; the model training module may specifically include:

the first training submodule is used for respectively training the single-language acoustic models corresponding to the language types according to the collected acoustic data of at least two language types;

the state labeling submodule is used for respectively labeling the states of the acoustic data of the at least two language types according to the single language acoustic model, wherein the states and the language types have corresponding relations;

and the second training submodule is used for training the multilingual acoustic model according to a data set formed by the acoustic data of at least two language types after being labeled.

Optionally, each of the acoustic data of the at least two language types corresponds to at least two language types; alternatively, each of the acoustic data of the at least two language types corresponds to one language type.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present invention provides an apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: determining the language type of a voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained by training according to acoustic data of at least two language types; decoding the voice frame according to a decoding network corresponding to the language type of the voice frame to obtain a first decoding result of the voice frame; and determining the recognition result corresponding to the voice information according to the first decoding result.

Fig. 3 is a block diagram illustrating an apparatus 800 for data processing in accordance with an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 3, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 4 is a schematic diagram of a server in some embodiments of the invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer-readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the data processing method shown in fig. 1.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform a data processing method, the method comprising: determining the language type of a voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained by training according to acoustic data of at least two language types; decoding the voice frame according to a decoding network corresponding to the language type of the voice frame to obtain a first decoding result of the voice frame; and determining the recognition result corresponding to the voice information according to the first decoding result.

The embodiment of the invention discloses A1 and a data processing method, which comprises the following steps: determining the language type of a voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained by training according to acoustic data of at least two language types;

A2 method according to A1, wherein the determining the language type of a speech frame in speech information according to a multilingual acoustic model comprises:

determining the posterior probability of each state corresponding to the voice frame according to the multi-language acoustic model; wherein, the state and the language type have a corresponding relation;

determining the probability ratio of the posterior probability of the speech frame to each language type state according to the posterior probability of each state corresponding to the speech frame and the language type corresponding to each state;

and determining the language type of the voice frame according to the probability ratio.

A3, the method according to A1, further comprising, before said determining the language type of a speech frame in the speech information according to a multilingual acoustic model:

determining a target language type from the at least two language types;

decoding each voice frame in the voice information according to a decoding network corresponding to the target language type to obtain a second decoding result of each voice frame;

after said determining the language type of a speech frame in the speech information according to the multilingual acoustic model, the method further comprises:

the decoding the voice frame according to the decoding network corresponding to the language type of the voice frame to obtain a first decoding result of the voice frame includes:

decoding the target voice frame according to a decoding network corresponding to the language type of the target voice frame to obtain a first decoding result of the target voice frame;

the determining, according to the first decoding result, the recognition result corresponding to the voice information includes:

and replacing the second decoding result of the target speech frame with the first decoding result of the language type corresponding to the target speech frame, and taking the replaced second decoding result as the recognition result corresponding to the speech information.

A4, the method of A3, the first decoding result, and the second decoding result comprising: time boundary information of the corresponding voice frame;

replacing the second decoding result of the target speech frame with the first decoding result of the language type corresponding to the target speech frame, including:

determining a replaced result from a second decoding result of the target speech frame; wherein the replaced result coincides with a time boundary of a first decoding result of a language type corresponding to the target speech frame;

and replacing the replaced result with a first decoding result of the language type corresponding to the target voice frame.

A5, the method of A1, the decoding network comprising: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training a language model according to the universal text corpus; the professional decoding network comprises: training a language model according to the text corpus of the preset field;

decoding the voice frame according to the general decoding network and the professional decoding network respectively to obtain a first score of the voice frame corresponding to the general decoding network and a second score of the voice frame corresponding to the professional decoding network;

and taking the decoding result with the highest score in the first score and the second score as the first decoding result of the voice frame.

A6, according to the method of A1, the training step of the multilingual acoustic model includes:

respectively training a single language acoustic model corresponding to each language type according to the collected acoustic data of at least two language types;

according to the single language acoustic model, respectively carrying out state labeling on the acoustic data of the at least two language types, wherein the states and the language types have corresponding relations;

and training the multilingual acoustic model according to a data set formed by the acoustic data of at least two language types after the labeling.

A7, each of the acoustic data of the at least two language types corresponding to at least two language types according to the method of any one of A1 to A6; alternatively, each of the acoustic data of the at least two language types corresponds to one language type.

The embodiment of the invention discloses B8 and a data processing device, which comprises:

B9, the apparatus of B8, the type determination module comprising:

B10, the apparatus of B8, the apparatus further comprising:

the device further comprises:

the first decoding module includes:

the result determination module includes:

B11, the apparatus of B10, the first decoding result, and the second decoding result comprising: time boundary information of the corresponding voice frame;

the first result determination submodule, comprising:

B12, the apparatus of B8, the decoding network comprising: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training a language model according to the universal text corpus; the professional decoding network comprises: training a language model according to the text corpus of the preset field;

the first decoding module includes:

B13, the apparatus of B8, the apparatus further comprising: a model training module for training the multilingual acoustic models; the model training module comprises:

B14, the apparatus according to any of B8 to B13, each of the acoustic data of the at least two language types corresponding to at least two language types; alternatively, each of the acoustic data of the at least two language types corresponds to one language type.

The embodiment of the invention discloses C15, an apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for:

C16, the apparatus according to C15, the determining the language type of a speech frame in speech information according to a multilingual acoustic model, comprising:

C17, the device of C15, the device also configured to execute the one or more programs by one or more processors including instructions for:

determining a target language type from the at least two language types;

the device is also configured to execute, by one or more processors, the one or more programs including instructions for:

C18, the apparatus of C17, the first decoding result, and the second decoding result comprising: time boundary information of the corresponding voice frame;

C19, the apparatus of C15, the decoding network comprising: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training a language model according to the universal text corpus; the professional decoding network comprises: training a language model according to the text corpus of the preset field;

C20, the apparatus of C15, the training of the multilingual acoustic models comprising:

C21, the apparatus of any of C15 to C20, each of the acoustic data of the at least two language types corresponding to at least two language types; alternatively, each of the acoustic data of the at least two language types corresponds to one language type.

Embodiments of the present invention disclose D22, a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a data processing method as described in one or more of a 1-a 7.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The data processing method, the data processing apparatus and the apparatus for data processing provided by the present invention are described in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the description of the above embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein determining the language type of a speech frame in the speech information according to a multilingual acoustic model comprises:

3. The method of claim 1, wherein prior to said determining the language type of a speech frame in the speech information according to the multilingual acoustic model, the method further comprises:

determining a target language type from the at least two language types;

4. The method of claim 3, wherein the first decoding result and the second decoding result comprise: time boundary information of the corresponding voice frame;

5. The method of claim 1, wherein the decoding network comprises: a general decoding network and a professional decoding network; wherein, the general decoding network comprises: training a language model according to the universal text corpus; the professional decoding network comprises: training a language model according to the text corpus of the preset field;

6. The method of claim 1, wherein the step of training the multilingual acoustic models comprises:

7. The method of any of claims 1 to 6, wherein each of the acoustic data of the at least two language types corresponds to at least two language types; alternatively, each of the acoustic data of the at least two language types corresponds to one language type.

8. A data processing apparatus, characterized in that the apparatus comprises:

9. An apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:

10. A machine-readable medium having stored thereon instructions which, when executed by one or more processors, cause an apparatus to perform a data processing method as claimed in one or more of claims 1 to 7.