CN111640452B

CN111640452B - Data processing method and device for data processing

Info

Publication number: CN111640452B
Application number: CN201910157570.4A
Authority: CN
Inventors: 林国雯; 赵超
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-03-01
Filing date: 2019-03-01
Publication date: 2024-05-07
Anticipated expiration: 2039-03-01
Also published as: CN111640452A

Abstract

The embodiment of the invention provides a data processing method, a data processing device and a data processing device. The method specifically comprises the following steps: acquiring voice data of a user for pronunciation of a preset text and a phoneme sequence corresponding to the preset text; determining the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to a preset acoustic model; the preset acoustic model is obtained by training according to phoneme data of at least two language types and training data of the at least two language types; and if the matching degree smaller than the preset value exists, outputting error correction information. The embodiment of the invention can improve the pronunciation accuracy and efficiency of the user for learning the foreign language spoken language.

Description

Data processing method and device for data processing

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus, and a device for data processing.

Background

With the continuous development of economic globalization, foreign language learning is gaining more attention. At present, a plurality of electronic products (such as a point reader and a learning computer) for evaluating spoken language and applications of evaluating spoken language on a mobile terminal appear on the market, so that a user can be helped to correct spoken language pronunciation.

However, for users with heavy local accents, they are susceptible to the pronunciation of the native language during the foreign language learning process, with various local accents. For example, for Hunan, fujian and Guangdong, under the influence of local accents, the problem that the pronunciations of the mother language are not divided into flat sticks and the front and rear noses are not divided occurs, and in the process of learning the foreign language, the user brings the nonstandard pronunciations in the mother language into the pronunciations of the foreign language, so that the pronunciations of the foreign language are also nonstandard.

The existing spoken language evaluation method cannot identify the mispronunciations caused by the local accent in the pronunciation of the user, but the hearing ability of the user is insensitive to the mispronunciations caused by the local accent in the native language, and the mispronunciations of the user are difficult to identify and correct, so that the pronunciation accuracy and the pronunciation efficiency of the user for learning the foreign language spoken language are lower.

Disclosure of Invention

The embodiment of the invention provides a data processing method and device and a device for data processing, which can improve the pronunciation accuracy and efficiency of learning foreign language spoken language by a user.

In order to solve the above problems, an embodiment of the present invention discloses a data processing method, including:

Acquiring voice data of a user for pronunciation of a preset text and a phoneme sequence corresponding to the preset text;

Determining the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to a preset acoustic model; the preset acoustic model is obtained by training according to phoneme data of at least two language types and training data of the at least two language types;

And if the matching degree smaller than the preset value exists, outputting error correction information.

In another aspect, an embodiment of the present invention discloses a data processing apparatus, including:

the acquisition module is used for acquiring voice data of a user for pronunciation of a preset text and a phoneme sequence corresponding to the preset text;

The matching module is used for determining the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to a preset acoustic model; the preset acoustic model is obtained by training according to phoneme data of at least two language types and training data of the at least two language types;

and the error correction output module is used for outputting error correction information if the matching degree smaller than the preset value is determined.

In yet another aspect, an embodiment of the present invention discloses an apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:

In yet another aspect, embodiments of the invention disclose a machine-readable medium having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform a data processing method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

According to the embodiment of the invention, the voice data of the pronunciation of the preset text and the phoneme sequence corresponding to the preset text of the user can be obtained, the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence is determined according to the preset acoustic model, and if the matching degree smaller than the preset value is determined, the error correction information can be output. Because the preset acoustic model is obtained by training according to the phoneme data of at least two languages and the training data of at least two languages, the preset acoustic model can learn the pronunciation of at least two languages simultaneously, and through the preset acoustic model, not only the wrong pronunciation phoneme in the pronunciation of the foreign language which is read by the user can be detected, but also the wrong phoneme which is caused by the pronunciation of the foreign language due to the pronunciation of the native language of the user can be detected, and error correction information is output to the user, so that the user can correct the pronunciation of the user according to the error correction information, and further the pronunciation accuracy and efficiency of the foreign language learning of the user can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of an embodiment of a data processing method of the present invention;

FIG. 2 is a block diagram of an embodiment of a data processing apparatus of the present invention;

FIG. 3 is a block diagram of an apparatus 800 for data processing in accordance with the present invention; and

Fig. 4 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a data processing method according to the present invention may specifically include the following steps:

step 101, acquiring voice data of a user for pronunciation of a preset text and a phoneme sequence corresponding to the preset text;

102, determining the matching degree between a voice frame in the voice data and a phoneme in the phoneme sequence according to a preset acoustic model; the preset acoustic model is obtained by training according to phoneme data of at least two language types and training data of the at least two language types;

Step 103, if the matching degree smaller than the preset value is determined, outputting error correction information.

The data processing method of the embodiment of the invention can be used for evaluating and correcting the pronunciation of the user, and can be applied to electronic equipment, wherein the electronic equipment comprises but is not limited to: servers, smartphones, tablet computers, electronic book readers, MP3 (moving picture experts compression standard audio layer 3,Moving Picture Experts Group Audio Layer III) players, MP4 (moving picture experts compression standard audio layer 4,Moving Picture Experts Group Audio Layer IV) players, laptop computers, car computers, desktop computers, set top boxes, smart televisions, wearable devices, and the like.

Phonemes (phones), which are the smallest units in speech, are analyzed according to pronunciation actions in syllables, one action constituting one phoneme. For example, the english word "pleasure" includes the following phonemes: "p", "l", "eh", "zh", "ax".

The evaluation of the user pronunciation can be accurate to the phoneme level so as to improve the accuracy of the user pronunciation. Specifically, the embodiment of the invention can receive the voice data of the pronunciation of the preset text by the user through the electronic equipment, and can determine the phoneme sequence corresponding to the preset text according to the pronunciation dictionary because the preset text is known, and further can determine the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to the preset acoustic model, and if the matching degree is smaller than the preset value, the error correction information can be output to correct the pronunciation of the user.

It can be understood that, in the embodiment of the present invention, the method for obtaining the voice data of the preset text pronunciation by the user is not limited, for example, the voice data may be obtained by recording the voice data in real time by the radio device of the electronic device, or the electronic device may obtain the voice data from the client or the network by a wired connection method or a wireless connection method, or may also obtain the voice data according to an instant communication message obtained in an instant communication application, etc.

In the embodiment of the invention, the voice data of the user can be segmented into a plurality of voice frames according to the preset window length and frame shift so as to process the voice data frame by frame. If the voice data of the user is analog voice data (such as a recording of a call of the user), the analog voice data needs to be converted into digital voice data, and then the voice data is segmented.

Optionally, before the voice data of the user is segmented, the electronic device may further perform noise reduction and other processing on the voice data of the user, so as to improve the processing capability of the subsequent voice information.

After the voice data is sliced, acoustic features of each frame of voice frame can be extracted, and commonly used acoustic features can specifically include PLP (Perceptual LINEAR PREDICT IVE, perceptual linear prediction), MFCC (Mel-Frequency Cepstral Coefficient ) and other features. It will be appreciated that the specific type of acoustic feature is not limited by the embodiments of the present invention, and for example, MFCC features may be extracted, and 39-dimensional features may be taken in total, and after acoustic feature extraction is completed, the original speech data is converted into a feature vector sequence, where the feature vector sequence is composed of frames, and each frame is a 39-dimensional vector.

Next, a matching degree between a speech frame in the speech data and a phoneme in the phoneme sequence may be determined according to a preset acoustic model, and whether or not there is an erroneous pronunciation phoneme (hereinafter referred to as an erroneous phoneme) in the speech data may be judged according to the matching degree, for example, the preset value may be set according to a large number of experiments, and if the matching degree is smaller than the preset value, it may be determined that there is an erroneous phoneme in the speech data.

In an optional embodiment of the present invention, the determining, according to a preset acoustic model, a matching degree between a speech frame in the speech data and a phoneme in the phoneme sequence may specifically include:

S11, determining a phoneme sequence corresponding to the preset text;

Step S12, aligning the feature vector corresponding to the voice frame in the voice data with the phonemes in the phoneme sequence according to the decoding network formed by the preset acoustic model and the preset text;

And S13, determining the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to the likelihood between the aligned feature vector and the corresponding phonemes.

It should be noted that, in the conventional speech recognition, H (acoustic model), C (context-related), L (pronunciation dictionary) and G (language model) are formed into HCLG decoding networks, and feature vector sequences corresponding to speech data are decoded according to the decoding networks to obtain speech recognition results. The language model is usually a general n-gram (n-gram language model), which not only requires huge storage space, but also has high decoding complexity.

In the embodiment of the invention, the voice data is obtained by the user according to the pronunciation of the preset text, and the text corresponding to the voice data is known unlike the traditional voice recognition, so that the embodiment of the invention can take the preset text as G (language model) in HCLG decoding network, occupies smaller storage space, is convenient to store and use, can reduce decoding complexity and improves decoding efficiency.

Since the preset text is predetermined, a phoneme sequence corresponding to the preset text is already determined. For example, for preset text "pleasure," its corresponding phoneme sequence is (p, l, eh, zh, ax).

After determining the phoneme sequence corresponding to the preset text, the feature vector sequence corresponding to the voice data may be decoded according to the decoding network formed by the preset acoustic model and the preset text, so as to align the feature vector in the feature vector sequence with the phonemes in the phoneme sequence, that is, calculate the voice frame boundary (start frame and end frame) corresponding to each phoneme in the phoneme sequence.

Taking the voice data of the pronunciation of the preset text "pleasure" by the user as an example, the "alignment" refers to determining which phoneme in the phoneme sequence (p, l, eh, zh, ax) the feature vector corresponding to each frame of voice frame in the voice data belongs to based on the preset acoustic model and the feature vector sequence corresponding to the voice data, that is, determining the boundary of each phoneme in the phoneme sequence (p, l, eh, zh, ax) corresponding to the voice frame in the voice data.

According to the likelihood between the aligned feature vector and the corresponding phoneme, the matching degree between the voice frame in the voice data and the phoneme in the phoneme sequence can be determined. It may be appreciated that the embodiment of the present invention does not limit a specific implementation manner of determining the matching degree, for example, the likelihood may be directly used as the matching degree, or a posterior probability may be calculated according to the likelihood, and the posterior probability may be used as the matching degree.

For example, assume that voice data of a user pronouncing a preset text "pleasure" is acquired, and the matching degree between a feature vector corresponding to a voice frame in the voice data and phonemes in a phoneme sequence (p, l, eh, zh, ax) is determined according to a preset acoustic model, where the matching degree is respectively as follows: 10. 9, 4, 8; the matching degree of the phonemes/zh/corresponding is 4, which is smaller than a preset value (assuming that the preset value is 6), and it can be determined that the voice data of the user has an error phoneme, and error correction information can be output, for example, the error phoneme and a correct phoneme corresponding to the error phoneme can be output, so that the user can perform contrast error correction, and the pronunciation accuracy of the user is improved.

In an alternative embodiment of the present invention, after determining that there is a matching degree smaller than the preset value, the method may further include:

s21, taking the phonemes with the matching degree smaller than a preset value as target phonemes;

step S22, determining confusing phonemes corresponding to the target phonemes in the phoneme data of the at least two language types;

And S23, outputting the confusing phonemes.

In the embodiment of the invention, the phoneme data of the at least two language types can be arranged in advance so as to arrange the phoneme data with easily confused pronunciation. In the process of evaluating the voice data of the user, if the matching degree smaller than the preset value exists, the phonemes with the matching degree smaller than the preset value can be used as target phonemes, the confusing phonemes corresponding to the target phonemes are determined according to the pre-arranged confusing phoneme data, and the confusing phonemes are output so that the user can actively distinguish the confusing phonemes and correct the pronunciation of the user.

For example, in the above example, since the likelihood of the phoneme "zh" is smaller than the preset value, the phoneme "zh" may be taken as the target phoneme, and then the confusing phoneme corresponding to the phoneme "zh" may be found, for example, including: "rh" (r "in Chinese Pinyin)," sh "(/ -je /) and" jh "(d ₃ /) in English), then the phonemes" zh "and their corresponding confusing phonemes can be output: "rh", "sh", "jh".

In the embodiment of the present invention, the pronunciation of the phonemes is represented by IPA (International Phonetic Alphabet, international phonetic symbols), for example, the above-mentioned/≡/IPA pronunciation of "sh", and/d ₃/IPA pronunciation of "jh".

Optionally, in order to further determine which phoneme is the wrong phoneme in the voice data of the user, after determining the confusing phoneme corresponding to the target phoneme, the method may further include:

Step S31, replacing a target phoneme in the phoneme sequence with the confusing phoneme;

Step S32, decoding the context triphones corresponding to the confusable phonemes in the replaced phoneme sequence to obtain the likelihood corresponding to the confusable phonemes;

And step S33, determining the confusing phonemes with the maximum likelihood as error phonemes in the voice data.

For example, in the above example, the phoneme "zh" may be taken as the target phoneme, and determining the confusing phoneme corresponding to the target phoneme "zh" includes: and (3) replacing the target phonemes (zh) in the phoneme sequences (p, l, eh, zh and ax) with confusing phonemes, and decoding the context triphones corresponding to the confusing phonemes in the replaced phoneme sequences to obtain the likelihood corresponding to the confusing phonemes.

Specifically, the following three contextual triphones can be obtained after substitution: "eh-rh+ax", "eh-sh+ax", "eh-jh+ax", decoding the three context triphones according to a preset acoustic model, respectively, to calculate likelihoods corresponding to the confusing phonemes "rh", "sh", "jh", respectively, determining the confusing phoneme with the highest likelihood as an erroneous phoneme in the speech data, for example, if it is determined that the likelihood corresponding to the confusing phoneme "rh" is the largest, it may be determined that the erroneous phoneme in the speech data is "rh", that is, the user incorrectly pronounces the phoneme "zh" in pleasure as "rh" (r "in chinese pinyin).

In order to provide more error correction information to the user, in an alternative embodiment of the invention, the error correction information may comprise at least any one of the following: the method comprises the steps of obtaining an acoustic model score corresponding to voice data, an acoustic model score corresponding to each phoneme in the voice data, an error phoneme in the voice data and a correct phoneme corresponding to the error phoneme.

The acoustic model score corresponding to each phoneme in the voice data can be calculated by a GOP (Goodness Of Pronunciation, pronunciation evaluation) algorithm, and the GOP score can be calculated by the following formula:

In formula (1), q _i denotes a phoneme, O denotes an observation value, P (q _i |o) is a probability of the phoneme q _i in the case of the observation value O, and NF (O) denotes the number of frames in the observation interval. P (o|q _i) represents the likelihood that phoneme q _i corresponds, and as can be seen from equation (1), GOP (qi) can measure the probability that this segment of speech corresponds to phoneme q _i in the case that user speech O is observed, and the higher this probability is, the more accurate this probability is, and the lower this probability is, the worse this probability is, the user pronunciation is.

In the embodiment of the present invention, the acoustic model score of a phoneme can be calculated by the following formula:

score＝f(GOP(q_i)) (2)

specifically, the value of GOP (q _i) may be mapped to the percentile according to the preset function f, so that the acoustic model score of phoneme q _i may be obtained.

After the acoustic model score of the phoneme q _i is obtained, regression is carried out on the acoustic model score of each phoneme in the word to obtain sum and average value, thus obtaining the acoustic model score corresponding to the word, and similarly, the acoustic model score corresponding to the whole voice data can be obtained through calculation.

In an application example of the present invention, assuming that voice data of a user pronouncing for a preset text "ticket" is acquired, in which the user incorrectly pronounces the last phoneme "t" as a phoneme "jh" (in english/d ₃ /), error correction information may be input through the embodiment of the present invention, and the output error correction information is represented in json (JavaScript Object Notation, JS object numbered notation) format.

Wherein, "score" represents an acoustic model score, "start" represents a start frame, "end" represents an end frame, "word" represents a preset text, "phoneinfo" represents a phoneme sequence corresponding to the preset text, "phoneinfo" each structure represents a phoneme, "phone" represents an original phoneme, and "recognition" represents a recognition result.

As can be seen from the above-presented error correction information, the original phoneme of the last phoneme should be "t", and the detection result is "jh", that is, the user wrongly reads "t" as "jh" (/ d ₃ /), and the acoustic model score of the phoneme/t/is 30.01, and the score is lower. According to the error correction information, the user can intuitively know the pronunciation conditions of each phoneme in own pronunciation, so that the user can correct the error pronunciation more conveniently, and the pronunciation accuracy of the user is improved.

It may be appreciated that the foregoing case of presenting the error correction information in json format is merely an application example of the present invention, and the present invention is not limited to the presentation manner of the error correction information, for example, the presentation manner may include: text display mode, picture display mode, audio display mode, video display mode and the like, wherein the text display mode can comprise: list, table, etc.

It should be noted that, for convenience of description, in the embodiment of the present invention, words are taken as examples to describe the embodiment, and in practical application, the preset text may include: words, phrases, sentences, paragraphs, articles, etc.

In an alternative embodiment of the invention, the preset acoustic model may be trained by:

s41, labeling the phoneme data of at least two language types; wherein, the phoneme data with the same pronunciation or meeting the approximate condition in different language types use the same label symbol, and the phoneme data with different pronunciation in different language types use different label symbol;

and step S42, training a preset acoustic model according to the labeled phoneme data and the collected training data of the at least two language types.

In the embodiment of the invention, the preset acoustic model can be a classification model fused with a plurality of neural networks. The neural network includes, but is not limited to, at least one or a combination, superposition, nesting of at least two of the following: CNN (Convolutional Neural Network ), LSTM (Long Short-Term Memory) network, RNN (SimpleRecurrent Neural Network, recurrent neural network), attention neural network, and the like.

The number and the types of the language types of the training preset acoustic model are not limited, for example, the Chinese English is taken as an example, and the language types of the training preset acoustic model can comprise English and Chinese, wherein the Chinese is a native language, and the English is a foreign language. As another example, taking japanese french as an example, the language types of training the preset acoustic model may include japanese and french, where japanese is a native language and french is a foreign language.

For convenience of description, the embodiment of the invention takes training of preset acoustic models of English and Chinese language types as examples, and application scenes of other language types can be referred to each other.

Firstly, the phoneme data of at least two language types can be marked to obtain a mixed language pronunciation dictionary, specifically, the phoneme data with the same pronunciation or the similar condition in different language types can be combined into one type and have the same mark symbol, and the phoneme data with different pronunciation in different language types can be divided into different types and have different mark symbols.

For example: if/₃/(IPA is expressed as/₃ /) in english word "pleasure", and the phoneme does not exist in chinese, then/₃/may be independently a phoneme, may independently have different labels, and assume that the label corresponding to the phoneme is "zh"; "r" (IPA is expressed as/z /) in Chinese "day" is not present in English, and can be independently a phoneme, and the symbol corresponding to the phoneme is assumed to be "rh"; the English word "dest" in/d/(IPA is indicated as/d /) and Chinese "take" in "d" (IPA is indicated as/d /), and the pronunciation can be classified as one type and marked as "d" because the pronunciation is difficult to distinguish in hearing; the pronunciation can be classified into one type and labeled "z" because the pronunciation satisfies the approximate condition because the pronunciation is difficult to distinguish from the "z" (IPA is expressed as/z /) in English zoo and the "z" (IPA is expressed as/ts /) in Chinese.

Through the classification, the combination of the phoneme data in English and the phoneme data in Chinese can be realized, and a mixed language pronunciation dictionary is obtained. Finally, a preset acoustic model can be trained according to the labeled phoneme data (mixed language pronunciation dictionary) and the collected training data of the at least two language types.

It can be understood that, the training process of the preset acoustic model may be an existing arbitrary model training method, and the training method of the preset acoustic model is not limited in this embodiment of the present invention, and because the method of training the model is a conventional technical means, a detailed description is omitted here.

After training to obtain the preset acoustic model, the voice data of the preset text pronunciation can be evaluated and corrected according to the preset acoustic model.

Since the preset acoustic model is obtained by training according to the phoneme data of at least two language types and the training data of at least two language types, the preset acoustic model can learn the pronunciation of at least two languages simultaneously, and can detect incorrect phonemes caused by foreign language pronunciation due to the native language pronunciation of the user besides incorrect pronunciation phonemes in foreign language pronunciation as read by the user, for example, hunan and Sichuan people can learn ' Idon't know ' because of l and n differencePronunciation into/>By the embodiment of the invention, the user's will/>, can be detectedFor this mispronounced sound, error correction information can be output to the user, for example, the error phoneme "l" in the voice data and the correct phoneme "n" corresponding to the error phoneme can be output, so that the user can correct his own pronunciation according to the error correction information, and further the accuracy and efficiency of learning the foreign language spoken language by the user can be improved.

In an alternative embodiment of the present invention, each of the at least two language types of training data corresponds to at least two language types, and/or each of the at least two language types of training data corresponds to one language type.

Specifically, the embodiment of the invention can collect the single-language speech data corresponding to at least two language types respectively, and train the preset acoustic model according to the training data set formed by the single-language speech data corresponding to each language type. For example, the voice data corresponding to "weather today is good" may be a single-language voice data, "What' S THE WEATHER LIKE today? The corresponding speech data may also be a single speech data.

In daily language expressions, mixed expressions of multiple languages may occur. Taking Chinese and English mixed expression as an example, a user can use English words and sentences alternately in the process of using the text to express, such as 'I buy the latest iPhone', 'I come first YESTERDAY ONCE MORE'. If the preset acoustic model is trained on a training data set consisting of only single language speech data, there may be cases where some phonemes are absent from the context triphones, which may result in the preset acoustic model being trained to be inaccurate.

For example, for the english word "without", the phoneme sequence is (w, ih, dh, aw, t), where the confusing phonemes corresponding to the phoneme "dh" include: "z" (IPA is expressed as/z /), "d" (both Chinese and English) then "dh" in "w-ih+dh" can be replaced with confusing phonemes to obtain "w-ih+z", "w-ih+d" when determining the wrong phonemes.

However, in the case of mixed expression of multiple languages, the user may incorrectly pronounce "dh" as "zhh", but the phonemes "zhh" (pinyin zh) are not present in the english corpus, so that the preset acoustic model cannot recognize the wrong phonemes "zhh", and in order to solve the above problem, the training data of the embodiment of the present invention may further include mixed speech data.

Specifically, the embodiment of the invention can collect mixed voice data containing at least two language types to train a preset acoustic model, wherein the mixed voice data refers to each data corresponding to at least two language types. For example, "i buy the latest iPhone", "come first YESTERDAY ONCEMORE", "Shall we go to the shoping in noon? The corresponding voice data are all mixed voice data.

Is you going to the shoping at Shall we noon? For example, the corresponding phoneme sequence of "we" may be obtained as (w, ih), the corresponding phoneme sequence of "middle" as (zhh, u, ng), the corresponding phoneme sequence of "removing" as (qh, vh), and the corresponding phoneme sequence of "dropping" as (sh, oo, p, i, ng).

As mixed speech data is added, a phoneme "zhh" can be added in the confusing phonemes of "dh", and the confusing phonemes corresponding to the phoneme "dh" are expanded as follows: "z", "d", "zhh". When determining the wrong phoneme, "dh" in "w-ih+dh" may be replaced with a confusable phoneme, resulting in "w-ih+z", "w-ih+d", "w-ih+ zhh". The 'w-ih+ zhh' triphones are obtained by expanding mixed voice data, and the preset acoustic model obtained by training can detect and correct the mispronunciations under the condition of mixed expression of multiple languages so as to improve the accuracy of pronunciation of users.

In summary, the embodiment of the invention can acquire the voice data of the pronunciation of the preset text and the phoneme sequence corresponding to the preset text by the user, determine the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to the preset acoustic model, and output error correction information if the matching degree is determined to be smaller than the preset value. Because the preset acoustic model is obtained by training according to the phoneme data of at least two languages and the training data of at least two languages, the preset acoustic model can learn the pronunciation of at least two languages simultaneously, and through the preset acoustic model, not only the wrong pronunciation phoneme in the pronunciation of the foreign language which is read by the user can be detected, but also the wrong phoneme which is caused by the pronunciation of the foreign language due to the pronunciation of the native language of the user can be detected, and error correction information is output to the user, so that the user can correct the pronunciation of the user according to the error correction information, and further the accuracy and the efficiency of learning the foreign language of the user can be improved.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Device embodiment

With reference to FIG. 2, there is shown a block diagram of an embodiment of a data processing apparatus of the present invention, which may include in particular:

An obtaining module 201, configured to obtain voice data of a preset text pronunciation by a user and a phoneme sequence corresponding to the preset text;

A matching module 202, configured to determine a matching degree between a speech frame in the speech data and a phoneme in the phoneme sequence according to a preset acoustic model; the preset acoustic model is obtained by training according to phoneme data of at least two language types and training data of the at least two language types;

And the error correction output module 203 is configured to output error correction information if it is determined that there is a matching degree smaller than the preset value.

Optionally, the matching module 202 may specifically include:

A sequence determination submodule, configured to determine a phoneme sequence corresponding to the preset text;

The alignment sub-module is used for aligning the feature vector corresponding to the voice frame in the voice data with the phonemes in the phoneme sequence according to a decoding network formed by the preset acoustic model and the preset text;

and the matching degree determination submodule is used for determining the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to the likelihood between the aligned feature vector and the corresponding phonemes.

Optionally, the apparatus may further include:

The target determining submodule is used for taking the phonemes with the matching degree smaller than a preset value as target phonemes;

The confusion phoneme determining submodule is used for determining a confusion phoneme corresponding to the target phoneme in the phoneme data of the at least two language types;

And the phoneme output sub-module is used for outputting the confusing phonemes.

Optionally, the apparatus may further include:

a replacing sub-module, configured to replace a target phoneme in the phoneme sequence with the confusing phoneme;

the decoding sub-module is used for decoding the context triphones corresponding to the confusable phonemes in the replaced phoneme sequence to obtain the likelihood corresponding to the confusable phonemes;

and the error determination submodule is used for determining the confusing phonemes with the maximum likelihood as error phonemes in the voice data.

Optionally, the apparatus may further include: the training module is used for training the preset acoustic model; the training module comprises:

the labeling sub-module is used for labeling the phoneme data of at least two language types; wherein, the phoneme data with the same pronunciation or meeting the approximate condition in different language types use the same label symbol, and the phoneme data with different pronunciation in different language types use different label symbol;

and the training sub-module is used for training a preset acoustic model according to the labeled phoneme data and the collected training data of the at least two language types.

Optionally, each of the training data of the at least two language types corresponds to at least two language types, and/or each of the training data of the at least two language types corresponds to one language type.

Optionally, the error correction information includes at least any one of the following: the method comprises the steps of obtaining an acoustic model score corresponding to voice data, an acoustic model score corresponding to each phoneme in the voice data, an error phoneme in the voice data and a correct phoneme corresponding to the error phoneme.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

An embodiment of the present invention provides an apparatus for data processing, including a memory, and one or more programs, wherein the one or more programs are stored in the memory, and configured to be executed by one or more processors, the one or more programs comprising instructions for: acquiring voice data of a user for pronunciation of a preset text and a phoneme sequence corresponding to the preset text; determining the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to a preset acoustic model; the preset acoustic model is obtained by training according to phoneme data of at least two language types and training data of the at least two language types; and if the matching degree smaller than the preset value exists, outputting error correction information.

Fig. 3 is a block diagram illustrating an apparatus 800 for data processing according to an example embodiment. For example, apparatus 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 3, apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the apparatus 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on the device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 806 provides power to the various components of the device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen between the device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the apparatus 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or one component of the apparatus 800, the presence or absence of user contact with the apparatus 800, an orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the apparatus 800 and other devices, either in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of apparatus 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Fig. 4 is a schematic diagram of a server in some embodiments of the invention. The server 1900 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage mediums 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Wherein the memory 1932 and storage medium 1930 may be transitory or persistent. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, a central processor 1922 may be provided in communication with a storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as WindowsServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

A non-transitory computer readable storage medium, which when executed by a processor of an apparatus (server or terminal) enables the apparatus to perform the data processing method shown in fig. 1.

A non-transitory computer readable storage medium, which when executed by a processor of an apparatus (server or terminal), causes the apparatus to perform a data processing method, the method comprising: acquiring voice data of a user for pronunciation of a preset text and a phoneme sequence corresponding to the preset text; determining the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to a preset acoustic model; the preset acoustic model is obtained by training according to phoneme data of at least two language types and training data of the at least two language types; and if the matching degree smaller than the preset value exists, outputting error correction information.

The embodiment of the invention discloses A1, a data processing method, which comprises the following steps:

A2, determining the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to the method of A1, wherein the matching degree comprises the following steps:

determining a phoneme sequence corresponding to the preset text;

Aligning feature vectors corresponding to the voice frames in the voice data with phonemes in the phoneme sequence according to a decoding network formed by the preset acoustic model and the preset text;

And determining the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to the likelihood between the aligned feature vectors and the corresponding phonemes.

A3, after determining that there is a matching degree smaller than a preset value, the method according to A2 further comprises:

taking the phonemes with the matching degree smaller than a preset value as target phonemes;

Determining confusing phonemes corresponding to the target phonemes in the phoneme data of the at least two language types;

outputting the confusing phonemes.

A4, after determining the confusing phonemes corresponding to the target phonemes, the method according to A3 further includes:

Replacing a target phoneme in the phoneme sequence with the confusing phoneme;

Decoding the context triphones corresponding to the confusable phonemes in the replaced phoneme sequence to obtain the likelihood corresponding to the confusable phonemes;

And determining the confusing phonemes with the maximum likelihood as the wrong phonemes in the voice data.

A5, training the preset acoustic model according to the method of A1 by the following steps:

Labeling the phoneme data of at least two language types; wherein, the phoneme data with the same pronunciation or meeting the approximate condition in different language types use the same label symbol, and the phoneme data with different pronunciation in different language types use different label symbol;

Training a preset acoustic model according to the labeled phoneme data and the collected training data of the at least two language types.

A6, according to the method of any one of A1 to A5, each data in the training data of at least two language types corresponds to at least two language types, and/or each data in the training data of at least two language types corresponds to one language type.

A7, the method according to any one of A1 to A5, wherein the error correction information at least includes any one of the following: the method comprises the steps of obtaining an acoustic model score corresponding to voice data, an acoustic model score corresponding to each phoneme in the voice data, an error phoneme in the voice data and a correct phoneme corresponding to the error phoneme.

The embodiment of the invention discloses a B8 data processing device, which comprises:

B9, the device according to B8, the matching module includes:

B10, the apparatus of B9, the apparatus further comprising:

B11, the apparatus of B10, the apparatus further comprising:

B12, the apparatus of B8, the apparatus further comprising: the training module is used for training the preset acoustic model; the training module comprises:

B13, means according to any one of B8 to B12, each of the training data of the at least two language types corresponding to at least two language types, and/or each of the training data of the at least two language types corresponding to one language type.

B14, the apparatus according to any one of B8 to B12, wherein the error correction information includes at least any one of: the method comprises the steps of obtaining an acoustic model score corresponding to voice data, an acoustic model score corresponding to each phoneme in the voice data, an error phoneme in the voice data and a correct phoneme corresponding to the error phoneme.

The embodiment of the invention discloses a C15, a device for data processing, which comprises a memory and one or more programs, wherein the one or more programs are stored in the memory, and are configured to be executed by one or more processors, and the one or more programs comprise instructions for:

C16, the apparatus according to C15, the determining, according to a preset acoustic model, a matching degree between a speech frame in the speech data and a phoneme in the phoneme sequence, includes:

determining a phoneme sequence corresponding to the preset text;

C17, the device of C16, the device further configured to be executed by one or more processors, the one or more programs comprising instructions for:

outputting the confusing phonemes.

C18, the device of C17, the device further configured to be executed by one or more processors the one or more programs including instructions for:

Replacing a target phoneme in the phoneme sequence with the confusing phoneme;

C19, training the preset acoustic model by the apparatus of C15 by:

C20, the device according to any one of C15 to C19, each of the training data of at least two language types corresponds to at least two language types, and/or each of the training data of at least two language types corresponds to one language type.

C21, the apparatus according to any one of C15 to C19, the error correction information at least includes any one of: the method comprises the steps of obtaining an acoustic model score corresponding to voice data, an acoustic model score corresponding to each phoneme in the voice data, an error phoneme in the voice data and a correct phoneme corresponding to the error phoneme.

Embodiments of the invention disclose D22, a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a data processing method as described in one or more of A1 to A7.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

The foregoing has outlined a data processing method, a data processing device and a device for data processing in detail, wherein specific examples are provided herein to illustrate the principles and embodiments of the present invention, and the above examples are provided to assist in understanding the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of data processing, the method comprising:

Determining the matching degree between the voice frame in the voice data and the phonemes in the phoneme sequence according to a preset acoustic model; the preset acoustic model is obtained by training according to phoneme data of at least two language types and training data of the at least two language types; if the matching degree smaller than the preset value is determined, outputting error correction information;

the training data further includes: mixed voice data comprising at least two language types;

After determining that there is a degree of match less than the preset value, the method further comprises:

Outputting the confusing phonemes;

after determining the confusing phones corresponding to the target phones, the method further comprises:

Replacing a target phoneme in the phoneme sequence with the confusing phoneme;

2. The method of claim 1, wherein determining a degree of matching between a speech frame in the speech data and a phoneme in the sequence of phonemes based on a preset acoustic model comprises:

determining a phoneme sequence corresponding to the preset text;

3. The method of claim 1, wherein the pre-set acoustic model is trained by:

4. A method according to any one of claims 1 to 3, wherein each of the training data of the at least two language types corresponds to at least two language types and/or each of the training data of the at least two language types corresponds to one language type.

5. A method according to any one of claims 1 to 3, wherein the error correction information comprises at least any one of: the method comprises the steps of obtaining an acoustic model score corresponding to voice data, an acoustic model score corresponding to each phoneme in the voice data, an error phoneme in the voice data and a correct phoneme corresponding to the error phoneme.

6. A data processing apparatus, the apparatus comprising:

the error correction output module is used for outputting error correction information if the matching degree smaller than the preset value is determined to exist;

the apparatus further comprises:

A phoneme output sub-module for outputting the confusable phonemes;

7. An apparatus for data processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:

If the matching degree smaller than the preset value is determined, outputting error correction information;

After determining that there is a matching degree smaller than the preset value, the method further comprises:

Outputting the confusing phonemes;

Replacing a target phoneme in the phoneme sequence with the confusing phoneme;

8. A machine readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the data processing method of one or more of claims 1 to 5.