CN112951208B

CN112951208B - Method and device for speech recognition

Info

Publication number: CN112951208B
Application number: CN201911172039.0A
Authority: CN
Inventors: 程建峰
Original assignee: New Oriental Education Technology Group Co ltd
Current assignee: New Oriental Education Technology Group Co ltd
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2022-09-23
Anticipated expiration: 2039-11-26
Also published as: CN112951208A

Abstract

The embodiment of the application provides a method and a device for voice recognition. The method comprises the following steps: acquiring a voice to be recognized; obtaining a classification result of phonemes of the speech to be recognized according to a neural network model, wherein the classification result of the phonemes comprises which phoneme in a mixed phoneme set the phonemes are, the mixed phoneme set comprises all phonemes of a first language and a second language, the first language is a target language, and the second language is a native language of a speaker of the speech to be recognized; and determining the evaluation result of the speech to be recognized according to the classification result of the phonemes. The voice recognition method and the voice recognition device can improve user experience.

Description

Method and device for speech recognition

Technical Field

The present application relates to the field of speech signal processing technology, and more particularly, to a method and apparatus for speech recognition.

Background

Nowadays more and more people start learning foreign languages, wherein many people learn with the help of learning software. For example, many people practice spoken english pronunciation using software. Good learning software can help users to improve spoken language levels.

However, in the current speech recognition schemes such as the spoken language pronunciation evaluation system, pronunciation feedback is rough, evaluation is too simple, and it is still unclear how to improve the scoring result of the user, so that user experience is poor.

Therefore, it is desirable to provide an effective speech recognition scheme to enhance the user experience.

Disclosure of Invention

The application provides a voice recognition method and device, which can improve user experience.

In a first aspect, the present application provides a method for speech recognition, the method comprising: acquiring a voice to be recognized; obtaining a classification result of phonemes of the speech to be recognized according to a neural network model, wherein the classification result of the phonemes comprises which phoneme in a mixed phoneme set the phonemes are, the mixed phoneme set comprises all phonemes of a first language and a second language, the first language is a target language, and the second language is a native language of a speaker of the speech to be recognized; and determining the evaluation result of the speech to be recognized according to the classification result of the phonemes.

In the embodiment of the application, the phoneme in the voice to be recognized is identified as the phoneme in the target language phoneme and the native language phoneme through the neural network model, so that richer feedback can be provided for a user, the pronunciation of the user can be corrected accurately, and the experience of the user is improved.

In some possible implementations, the obtaining the classification result of the phoneme of the speech to be recognized according to the neural network model includes: acquiring the characteristics of the phonemes of the speech to be recognized; and acquiring a classification result of the phoneme according to the characteristics of the phoneme and the neural network model.

In some possible implementations, the first language is english, the second language is chinese, and the set of mixed phonemes includes all phonemes in english and all phonemes in chinese.

In some possible implementations, the evaluation result of the speech to be recognized includes: whether the pronunciation of the phoneme is correct or not, and whether the pronunciation of the phoneme is biased to be a Chinese phoneme or an English phoneme.

In some possible implementations, the determining an evaluation result of the speech to be recognized according to the classification result of the phonemes includes: if the phoneme is a correct English phoneme, determining that the pronunciation of the phoneme is correct and the pronunciation of the phoneme is biased to the English phoneme; and if the phoneme is the Chinese phoneme, determining that the pronunciation of the phoneme is wrong and the pronunciation of the phoneme is biased to the Chinese phoneme.

In some possible implementations, the evaluation result of the speech to be recognized further includes a score of the speech to be recognized; the determining the evaluation result of the speech to be recognized according to the classification result of the phonemes comprises: and determining the grade of the voice to be recognized according to the pronunciation of the phoneme.

In some possible implementations, the method further includes: training the neural network model according to a phoneme sample with a label, wherein the label is the similarity of the phonemes in the mixed phoneme set.

In some possible implementations, the method further includes: obtaining the characteristics of each frame of the audio sample; determining a phoneme sample according to the characteristics of all frames of the audio sample and the position of each phoneme in the audio sample; and marking the phoneme sample with a label to obtain the phoneme sample with the label.

In some possible implementations, the tagging the phoneme sample includes: for phonemes with obvious differences, the similarity is set to 0; for phonemes with irrelevant pronunciation modes, the similarity is set to be 0; for phonemes with similarity in pronunciation style, the intersection and comparison of the pronunciation style feature sets is taken as the similarity.

In some possible implementations, the neural network model is a trigeneration neural network model including three input layers, two of the three input layers for inputting correctly pronounced phoneme samples and another input layer for inputting incorrectly pronounced phoneme samples, the incorrectly pronounced phoneme samples including phoneme samples that are pronounced preferentially to phonemes of the second language.

In a second aspect, there is provided an apparatus for speech recognition, comprising means for performing the method of the first aspect or any possible implementation manner thereof.

In a third aspect, the present application further provides a computer including the above-mentioned speech recognition apparatus.

In a fourth aspect, the present application also provides a computer-readable storage medium storing computer-executable instructions configured to perform the above-described method for speech recognition.

In a fifth aspect, the present application also provides a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the above-mentioned method of speech recognition.

In a sixth aspect, the present application further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions, when executed by the at least one processor, cause the at least one processor to perform the method of speech recognition described above.

Drawings

FIG. 1 is a schematic diagram of a scenario in which the solution of an embodiment of the present application is applied;

FIG. 2 is a schematic flow chart diagram of a method of speech recognition of one embodiment of the present application;

FIG. 3 is a flow chart of a method of training a neural network of an embodiment of the present application;

FIG. 4 is a schematic diagram of a neural network model architecture of an embodiment of the present application;

FIG. 5 is a schematic diagram of the basic elements of a neural network model of an embodiment of the present application;

FIG. 6 is a diagram showing the evaluation result of speech in the embodiment of the present application;

FIG. 7 is a schematic block diagram of an apparatus for speech recognition according to one embodiment of the present application;

FIG. 8 is a schematic block diagram of an apparatus for speech recognition according to another embodiment of the present application; and

fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application are described below with reference to the accompanying drawings. It should be understood that the specific examples in this specification are intended merely to facilitate a better understanding of the embodiments of the application by those skilled in the art and are not intended to limit the scope of the embodiments of the application.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should also be understood that the various embodiments described in this specification may be implemented individually or in combination, and the examples of this application are not limited to this.

Unless otherwise defined, all technical and scientific terms used in the examples of this application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application.

An application scenario of the embodiment of the present application is illustrated below with reference to fig. 1.

Fig. 1 shows a schematic diagram of a scenario in which a method of speech recognition according to an embodiment of the present application is applied. As shown in fig. 1, the apparatus 110 for speech recognition is connected to the input device 120 in communication, and the speech to be recognized is input to the apparatus 110 through the input device 120, and the apparatus 110 can evaluate the input speech to be recognized.

For example, the speech to be recognized may be a piece of speech or a word recorded by the user.

The input device 120 may input a single voice or multiple voices simultaneously, which is not limited in this embodiment of the present application.

The apparatus 110 may be an electronic device or system, such as a computer, having information processing capabilities.

The apparatus 110 includes a processor for processing information, for example, recognizing and evaluating speech by using the technical solution of the embodiment of the present application. The processor may be any kind of processor, and the embodiment of the present application is not limited thereto.

The apparatus 110 may also include a memory. The memory may be used to store information and instructions, such as computer-executable instructions, that implement aspects of embodiments of the present application. The memory may be any kind of memory, and the embodiment of the present application is not limited to this.

The apparatus 110 may further include a communication interface, and the communication interface may be connected to the input device 120 in a wired manner or in a wireless manner.

The apparatus 110 may further include a display device for displaying the processing results, such as whether the pronunciation of the phoneme of the speech to be recognized is correct, whether the pronunciation is biased to Chinese phoneme or English phoneme or scoring condition, and the like.

The pronunciation of a speaker in a foreign language (target language) may be affected by the speaker's native spoken sound. For example, for Chinese, some people pronounce English with Chinese accents, so-called Chinese English. The existing spoken language pronunciation assessment scheme cannot feed back the situation, that is, a speaker does not know whether the pronunciation of the speaker has a Chinese accent or not, and does not know which phonemes are preferred to Chinese phonemes, so that the pronunciation cannot be effectively improved. Aiming at the situation, the application provides an improved technical scheme which can effectively feed back the influence of the native language on the target language and judge and score whether the pronunciation of the target language has the native language accent. By adopting the scheme, the Chinese English can be improved, so that the user experience is improved.

Fig. 2 shows a schematic flow diagram of a method 200 of speech recognition of an embodiment of the present application. The method 200 may be performed by the apparatus 110 of fig. 1.

And 210, acquiring the voice to be recognized.

The speech to be recognized may be a recording or a word of a foreign language learned by the user. The user may be a beginner or another user, and the present application is not limited to this, and the present application may be applied to any voice that the user wants to evaluate. The type of the language of the speech to be recognized is not limited, and the speech to be recognized may be various languages, and for convenience of description, english is taken as an example for the following description. Accordingly, the native language of the user is exemplified in Chinese.

220, obtaining a classification result of the phonemes of the speech to be recognized according to a neural network model, where the classification result of the phonemes includes which phoneme in a mixed phoneme set is the phoneme, the mixed phoneme set includes all phonemes of a first language and a second language, the first language is a target language, and the second language is a native language of a speaker of the speech to be recognized.

The neural network model is a pre-trained neural network model. The neural network model can obtain which phoneme in the mixed phoneme set the phoneme in the speech to be recognized is, wherein the mixed phoneme set comprises the phoneme of the target language and the phoneme of the native language of the speaker of the speech to be recognized. Taking the example of learning english by chinese as an example, the first language is english, the second language is chinese, and the mixed phoneme set includes all phonemes of english and all phonemes of chinese. For example, the mixed phone set may include 41 english phones and 69 chinese phones. In this case, it is possible to obtain which phoneme among all english phonemes and chinese phonemes (i.e., 110 phonemes) the phoneme in the speech to be recognized is by the neural network model. Since the native language (Chinese) may affect the pronunciations of the speaker, the speaker's pronunciation may be biased toward the Chinese phonemes. In this way, whether the phoneme in the speech to be recognized is an english phoneme or a chinese phoneme can be recognized through the neural network model in the embodiment of the present application, so that the user can be given more rich feedback to purposefully correct and improve pronunciation.

Specifically, for a piece of chinese english, some of the phonemes may be pronounced as chinese phonemes. For example, the word "nice" includes the three phonemes "n", "ai", and "s". In chinese english, the phoneme "n" may be issued as the chinese phoneme "that" and the phoneme "ai" may be issued as the chinese phoneme "ai". Through the neural network model in the embodiment of the application, whether each phoneme in the voice is uttered by an English phoneme or a Chinese phoneme can be identified, so that the improvement can be performed in a targeted manner, and the pronunciation of Chinese English is overcome.

The neural network model can be trained in an off-line mode and configured into equipment using the neural network model. Alternatively, the training of the neural network model may be performed by a device using the neural network model. That is, the devices that train and use the neural network model may be different devices or may be the same device. The training process of the neural network model is described below. It is to be understood that the following description is only exemplary, and should not be taken as limiting the embodiments of the present application.

Optionally, in an embodiment of the present application, the neural network model is trained according to a phoneme sample with a label, where the label is a similarity about phonemes in the mixed phoneme set.

For example, the neural network model is a trigeneration neural network model comprising three input layers, two of the three input layers for inputting correctly pronounced phoneme samples and another input layer for inputting incorrectly pronounced phoneme samples, the incorrectly pronounced phoneme samples comprising phoneme samples that are pronounced preferentially to phonemes of the second language. In this case, the labels of the phoneme samples are the similarities between them.

In the embodiment of the present application, the samples for training the neural network include phoneme samples biased toward the native pronunciation of the speaker, for example, phoneme samples biased toward the Chinese pronunciation, in addition to phoneme samples biased toward the correct pronunciation of the target language, for example, english phoneme samples biased toward the correct pronunciation. The labels of the phoneme samples are the similarities between the input phoneme samples. Based on such a sample, a neural network model that can identify whether a phoneme is an english phoneme or a chinese phoneme can be trained, so that chinese english can be effectively evaluated.

In addition, the use of the trigeneration neural network model can effectively distinguish different samples, for example, confusion between english phonemes and chinese phonemes can be distinguished, so that the neural network model for effectively identifying the english phonemes and the chinese phonemes can be trained.

Optionally, as an example, the tag may be marked in at least one of the following ways:

for phonemes with obvious differences, the similarity is set to 0;

for phonemes with irrelevant pronunciation modes, the similarity is set to be 0;

for phonemes with similarity in pronunciation style, the intersection and comparison of the pronunciation style feature sets is taken as the similarity.

In the embodiment of the present application, the tags are set according to the pronunciation manner of the phonemes. Specifically, the pronunciation mode feature set of the phoneme can be obtained based on the pronunciation mode of the phoneme, and the intersection of the pronunciation mode feature sets of different phonemes can be compared as the similarity of the pronunciation mode feature sets.

For example, in chinese and english, although some phonemes sound similar, their pronunciation positions and pronunciation modes are different. For example, t in English is the gingival sound (alveolar), and t in Mandarin Chinese is the dentate sound (dental). The tooth sound is formed by the tip or lobe of the tongue and the upper teeth. The gingival sound is formed by the tip or lobe of the tongue and the gingiva. Based on the characteristics, the neural network model describing the characteristics of the Chinese phonemes and the English phonemes is constructed, so that the Chinese phonemes and the English phonemes can be distinguished conveniently, and the neural network model can be trained conveniently.

Thus, according to the pronunciation mode of the phoneme, the feature describing the phoneme can be constructed.

For example, for consonants, it can be described from the following five aspects:

the condition of the vocal cords open or closed (unvoiced and voiced);

a tuning part;

oral or nasal articulation;

a tuning mode;

a central tone or a lateral tone (with or without air).

It should be understood that in most cases, the pronunciation need not be described from all five of the above aspects, but rather from several of them. For example, it is generally assumed that consonants are medians rather than frontphones and accents rather than nasals, so both of these aspects may generally be omitted.

For vowels, all vowel sounds are such that the tip of the tongue abuts against the back of the lower tooth and the body of the tongue rises in an arc shape. Vowels can be described from three aspects:

the height of the tongue body;

the front and back of the tongue are divided into front vowel (front voice), back vowel (back voice), and middle vowel;

the lip is rounded and divided into round lip (round) and unrounded lip (unrounded).

For example, the phoneme AE is a front vowel, the tongue is low, and the lip is not round, so that the feature { fnt, low, unr, vwl } can be constructed.

It should be understood that the above description of the phoneme features is only an example, and other description of the phoneme features may also be adopted, and the embodiment of the present application is not limited thereto.

For different phonemes, the intersection of their pronunciation style feature sets may be compared as their similarity. That is, the similarity between two phonemes can be obtained by the following formula:

for example, taking two phonemes of AE and EH as an example, the feature set of AE is { fnt, low, unr, vwl }, the feature set of EH is { cnt, mid, unr, vwl }, and the similarity between these two phonemes is:

alternatively, the similarity may be set to 0 for phonemes that are significantly different or have irrelevant pronunciation patterns. For example, the pronunciation modes of the nasal sound and the plosive sound are not related, and the similarity of the two phonemes is 0.

FIG. 3 shows a flow diagram of a method of training a neural network according to one embodiment of the present application.

301, features of each frame of an audio sample are obtained.

Time-frequency domain combined features, such as Mel Frequency Cepstral Coefficients (MFCC) 39-dimensional features, are extracted and normalized for each frame of audio samples.

Specifically, the time-frequency domain features are combined with formant features, and a certain proportion of pitch (distance) in the feature matrix is randomly zeroed by using an audio sample amplification method before being sent to the neural network. For example, the feature extraction may be performed in the following manner.

1. A high pass filter is used for pre-emphasis.

For voice signals, the low-frequency band energy of voice is large, the energy is mainly distributed in the low-frequency band, and the power spectral density of voice is reduced along with the increase of frequency, so that the output signal-to-noise ratio of the high-frequency band is obviously reduced, the transmission of the high-frequency band is weakened, the transmission of the high-frequency band is difficult, and the quality of the signals is influenced. The pre-emphasis is to boost the high frequency part to flatten the spectrum of the signal, and to maintain the spectrum in the whole frequency band from low frequency to high frequency, so that the spectrum can be obtained with the same signal-to-noise ratio. Meanwhile, the method is also used for eliminating the vocal cords and lip effects in the generation process, compensating the high-frequency part of the voice signal which is restrained by the pronunciation system, and highlighting the formants of the high frequency. The pre-emphasis is adopted to improve the high frequency, so that the frequency spectrum is flat as much as possible. The pre-emphasis process is to pass the speech signal through a high-pass filter.

2. The signal is smoothed using a hamming window.

And segmenting the speech to be recognized through framing and windowing. Framing is voice framing for voice to be recognized, generally, the duration of a phoneme is about 50-200 milliseconds, so the frame length is generally less than 50 milliseconds; the fundamental frequency of speech, male voice around 100 hz and female voice around 200 hz, is converted into periods of 10 ms and 5 ms, since a frame contains many periods, it is generally at least 20 ms. The purpose of windowing is to make the amplitude of a frame signal gradually change to 0 at both ends, which can improve the resolution of the transform result. Optionally, in this embodiment of the present application, the windowing operation employs a hamming window.

3. The time domain signal is converted to the frequency domain.

For example, for each frame that is cut out, a corresponding frequency spectrum is obtained by Fast Fourier Transform (FFT).

4. And simplifying the amplitude of the frequency domain through a Mel scale filter bank.

5. Affine transformation is performed on the mel-scale filter bank using a non-linear function.

6. And (5) splicing and combining the features of the step 2 and the step 5 of each frame to form a two-dimensional feature vector.

7. Data expansion: the two-dimensional feature vector is transformed by zero mean and unit variance and zeroed by a proportion of pitch, e.g., 5%, 10%, 15%. The sample size can be effectively increased through data expansion, so that the training efficiency is improved.

And 302, determining a phoneme sample according to the characteristics of all frames of the audio sample and the position of each phoneme in the audio sample.

One phoneme corresponds to a certain number of frames, and a phoneme sample may be cut out of the audio sample according to the position of each phoneme, e.g., the time start position of each phoneme. Each phoneme sample includes features of a corresponding number of frames.

303, labeling the phoneme sample to obtain a phoneme sample with a label.

For example, the phoneme sample may be labeled in the manner of label labeling described above. Alternatively, the tags may be encoded using a standard discrete normal distribution.

The neural network model is trained 304 from the labeled phoneme samples.

The labeled phoneme samples are input to an input layer of the neural network model. The neural Network model may adopt a three-generation neural Network model, and may also be referred to as a Triplet Network (Triplet Network). As shown in fig. 4, two phoneme samples X1 and X2 in the ternary input group are correctly pronounced phonemes, where X2 and X1 are the same kind or similar pronunciation phonemes, and X3 and X1 are phonemes with distinct pronunciation differences. The ternary input set inputs the same neural network 401 (sharing parameters with each other). The training of the three-tuple network is to make the phonemes of the same type or similar pronunciation as close as possible and the phonemes with obvious pronunciation difference as distant as possible. Based on this, the use of a three tuple network can better distinguish confusing phonemes. The neural network 401 is followed by a loss layer 402. The neural network 401 is trained by setting the lossy layer 402. For example, the loss may be a difference between the similarity of the actual output and the label, and the weights of the neural network 401 are repeatedly adjusted according to the difference until convergence. Alternatively, the neural network 401 may be trained using minipatch and stochastic gradient descent methods. The trained neural network 401 may result in a set of mixed phonemes. In this way, after the features of the phonemes are input subsequently, which phoneme in the mixed phoneme set is the phoneme can be obtained, i.e. the phonemes can be classified.

Alternatively, the neural network 401 may be constructed of block basic units. Figure 5 shows a schematic diagram of a block. As shown in fig. 5, each block may contain 2 convolutional network groups connected in series. One connection of the first set of Convolutional nets includes Convolutional Neural Networks (CNN) Convolutional layer CNN1 and Batch Normalization (BN) Normalization layers BN1, CNN2 and BN2, and the other connection is CNN3 and BN 3. One connection of the second set of convolutional networks includes cnn4 and bn4, cnn5 and bn 5. The convolution layer may be a 3 x3 standard filter. The convolutional layer is followed by an active layer, such as a linear rectifying unit (ReLu) active layer.

To address the problems of gradient dispersion and gradient explosion, another connection of the second set of convolutional networks may be shorted.

After a number of blocks, two layers of bidirectional Recurrent Neural Networks (RNN) with timing modeling capability can be combined with the last block module. The RNN may include 512 RNNcell (RNN cell). The loss layer is connected after the RNN.

It should be understood that the structure of the neural network described above is merely an example, and should not be construed as limiting the embodiments of the present application.

After training the neural network model, subsequently, when performing voice recognition, the characteristics of phonemes of the voice to be recognized can be obtained firstly; and obtaining the classification result of the phoneme according to the characteristics of the phoneme and the neural network model. The phoneme features may be obtained in the same manner as in the training phase described above. And inputting the characteristics of the phonemes into a neural network model to obtain a classification result of which phoneme in the mixed phoneme set the phoneme belongs to.

And 230, determining the evaluation result of the speech to be recognized according to the classification result of the phonemes.

After the neural network model is used to obtain the classification result of which phoneme in the speech to be recognized is in the mixed phoneme set, the evaluation result of the speech to be recognized can be generated based on the classification result.

Optionally, the evaluation result of the speech to be recognized includes: whether the pronunciation of the phoneme is correct or not, and whether the pronunciation of the phoneme is biased to Chinese phoneme or English phoneme. That is to say, the technical solution of the embodiment of the present application outputs the evaluation result of whether the pronunciation of the phoneme is biased to the chinese phoneme or the english phoneme, in addition to whether the pronunciation of the phoneme is correct.

In this case, the evaluation result of the speech to be recognized can be determined as follows:

if the phoneme is a correct English phoneme, determining that the pronunciation of the phoneme is correct and the pronunciation of the phoneme is biased to the English phoneme;

and if the phoneme is the Chinese phoneme, determining that the pronunciation of the phoneme is wrong and the pronunciation of the phoneme is biased to the Chinese phoneme.

In addition, if the phoneme is an english phoneme but is an erroneous phoneme, it may be determined that the pronunciation of the phoneme is erroneous and the pronunciation of the phoneme is biased toward the english phoneme.

In the embodiment of the application, the evaluation result of whether the pronunciation of the phoneme of the user is biased to the Chinese phoneme or the English phoneme is fed back, so that the user can conveniently and pertinently correct the pronunciation biased to the Chinese phoneme, and the pronunciation level of the user can be effectively improved.

Optionally, the evaluation result of the speech to be recognized may further include a score of the speech to be recognized. That is, in this case, the evaluation result of the speech to be recognized may include: whether the pronunciation of the phoneme is correct or not, whether the pronunciation of the phoneme is biased to be Chinese phoneme or English phoneme, and scoring of the voice to be recognized.

In this case, the score of the speech to be recognized may be determined based on the pronunciation of the phoneme. That is, the score of the speech to be recognized may be determined according to the pronunciation of each phoneme in the speech to be recognized.

Alternatively, the score for a word may be determined based on the pronunciation of each phoneme of the word. If the voice to be recognized comprises a plurality of words, the pronunciation quality (score) of the words can be fused to obtain the score of the voice to be recognized; the grade of the voice to be recognized can also be determined directly according to the pronunciation of each phoneme of the voice to be recognized; other fusion methods may be used, and the present application is not limited thereto as long as the score of the speech to be recognized can be determined according to the pronunciation of the phoneme.

Optionally, in an embodiment of the present application, the scores of the phonemes relative to the correct phonemes may also be obtained through a neural network model. The score is used as the score of the phoneme for feeding back the pronunciation quality of the phoneme and can also be used for determining the score of the speech to be recognized.

For example, after inputting the features of the phonemes into the neural network model, the neural network model outputs the corresponding similarity, which is used to generate the score of the phoneme (the score of the phoneme relative to the correct phoneme), in addition to outputting the phoneme as to which phoneme in the mixed phoneme set the phoneme is.

Fig. 6 is a schematic diagram showing an evaluation result of speech obtained by the technical solution of the embodiment of the present application. It can be seen that for the word "nice", the display shows whether each phoneme is pronounced for english or chinese, whether the pronunciation is correct, the score of each phoneme, and the score of the word. In the display result, the phoneme "n" is sent as the Chinese phoneme "that", and the phoneme "ai" is sent as the Chinese phoneme "love", so that the user can clearly know that the pronunciation of the user is biased to Chinese, the pronunciation can be corrected in a targeted manner, and the user experience is improved.

According to the technical scheme, whether the pronunciation of the target language has the native speech accent or not is subjected to feedback and evaluation, so that the user can be helped to overcome the influence of the native speech accent on the foreign language pronunciation, and the user experience can be improved.

The method embodiment of the present application is described in detail above with reference to fig. 1 to 6, and the apparatus embodiment of the present application is described below with reference to fig. 7 to 9, where the apparatus embodiment and the method embodiment correspond to each other, so that the non-detailed portions can be referred to the foregoing method embodiments, and the apparatus can implement any possible implementation manner of the foregoing method.

Fig. 7 shows a schematic block diagram of an apparatus 700 for speech recognition according to an embodiment of the present application. The apparatus 700 may perform the method for speech recognition according to the embodiment of the present application, for example, the apparatus 700 may be the apparatus 110.

As shown in fig. 7, the apparatus 700 may include:

an obtaining module 710, configured to obtain a speech to be recognized;

a classification module 720, configured to obtain a classification result of a phoneme of the speech to be recognized according to a neural network model, where the classification result of the phoneme includes which phoneme in a mixed phoneme set the phoneme is, where the mixed phoneme set includes all phonemes of a first language and a second language, the first language is a target language, and the second language is a native language of a speaker of the speech to be recognized;

and the speech recognition module 730 is configured to determine an evaluation result of the speech to be recognized according to the classification result of the phonemes.

Optionally, the classification module 720 is specifically configured to:

acquiring the characteristics of the phonemes of the speech to be recognized;

and acquiring the classification result of the phoneme according to the characteristics of the phoneme and the neural network model.

Optionally, the first language is english, the second language is chinese, and the mixed phone set includes all phones in english and all phones in chinese.

Optionally, the evaluation result of the speech to be recognized includes: whether the pronunciation of the phoneme is correct or not, and whether the pronunciation of the phoneme is biased to be a Chinese phoneme or an English phoneme.

Optionally, the speech recognition module 730 is specifically configured to:

if the phoneme is a Chinese phoneme, determining that the pronunciation of the phoneme is wrong and the pronunciation of the phoneme is biased to the Chinese phoneme.

Optionally, the evaluation result of the speech to be recognized further includes a score of the speech to be recognized;

the speech recognition module 730 is specifically configured to:

and determining the grade of the voice to be recognized according to the pronunciation of the phoneme.

Optionally, as shown in fig. 8, the apparatus 700 may further include:

a training module 740 configured to train the neural network model according to the phoneme sample with a label, where the label is a similarity with respect to the phonemes in the mixed phoneme set.

Optionally, the training module 740 is configured to:

obtaining the characteristics of each frame of the audio sample;

determining a phoneme sample according to the characteristics of all frames of the audio sample and the position of each phoneme in the audio sample;

and marking the phoneme sample with a label to obtain the phoneme sample with the label.

Optionally, the training module 740 is configured to:

for phonemes with obvious differences, the similarity is set to 0;

Optionally, the neural network model is a trigenerative neural network model, and includes three input layers, two of the three input layers are used for inputting correctly pronounced phoneme samples, and another input layer is used for inputting incorrectly pronounced phoneme samples, and the incorrectly pronounced phoneme samples include phoneme samples that are pronouncedly biased toward phonemes of the second language.

The embodiment of the present application further provides a computer (or other terminal devices) including the speech recognition apparatus 700 described above.

An embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions configured to execute the voice recognition method 200.

Embodiments of the present application also provide a computer program product, which includes a computer program stored on a computer-readable storage medium, the computer program including program instructions, which, when executed by a computer, cause the computer to execute the above-mentioned speech recognition method 200.

The computer-readable storage medium described above may be a transitory computer-readable storage medium or a non-transitory computer-readable storage medium.

An embodiment of the present application further provides an electronic device 900, a structure of which is shown in fig. 9, where the electronic device includes:

at least one processor (processor)910, such as processor 910 in FIG. 9; and a memory (memory)920, and may further include a communication interface (communication interface)940 and a bus 930. The processor 910, the communication interface 940 and the memory 920 may communicate with each other via the bus 930. Communication interface 940 may be used for information transfer. Processor 910 may invoke logic instructions in memory 920 to perform the methods of speech recognition of the above-described embodiments.

Furthermore, the logic instructions in the memory 920 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product.

The memory 920 is used as a computer-readable storage medium for storing software programs, computer-executable programs, such as program instructions or modules corresponding to the methods in the embodiments of the present application. The processor 910 executes functional applications and data processing, namely, a method of speech recognition in the above-described method embodiments, by executing software programs, instructions and modules stored in the memory 920.

The memory 920 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Additionally, memory 920 may include high speed random access memory and may also include non-volatile memory.

The technical solution of the embodiment of the present application may be embodied in the form of a software product, where the computer software product is stored in a storage medium and includes one or more instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiment of the present application. And the aforementioned storage media may be non-transitory storage media comprising: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be transient storage media.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and in actual implementation, there may be other divisions, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The words used in this application are words of description only and not of limitation of the claims. As used in the description of the embodiments and the claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Similarly, the term "and/or" as used in this application is meant to encompass any and all possible combinations of one or more of the associated listed. Furthermore, the terms "comprises" and/or "comprising," when used in this application, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The various aspects, implementations, or features of the described embodiments can be used alone or in any combination. Aspects of the described embodiments may be implemented by software, hardware, or a combination of software and hardware. The described embodiments may also be embodied by a computer-readable medium having computer-readable code stored thereon, the computer-readable code comprising instructions executable by at least one computing device. The computer readable medium can be associated with any data storage device that can store data which can be read by a computer system. Exemplary computer readable media can include read-only memory, random-access memory, CD-ROMs, HDDs, DVDs, magnetic tape, and optical data storage devices, among others. The computer readable medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

The above description of the technology may refer to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration embodiments in which the described embodiments may be practiced. These embodiments, while described in sufficient detail to enable those skilled in the art to practice them, are non-limiting; other embodiments may be utilized and changes may be made without departing from the scope of the described embodiments. For example, the order of operations described in a flowchart is non-limiting, and thus the order of two or more operations illustrated in and described in accordance with the flowchart may be altered in accordance with several embodiments. As another example, in several embodiments, one or more operations illustrated in and described with respect to the flowcharts may be optional or may be eliminated. Additionally, certain steps or functions may be added to the disclosed embodiments, or two or more steps may be permuted in order. All such variations are considered to be encompassed by the disclosed embodiments and the claims.

Additionally, terminology is used in the foregoing description of the technology to provide a thorough understanding of the described embodiments. However, no unnecessary detail is required to implement the described embodiments. Accordingly, the foregoing description of the embodiments has been presented for purposes of illustration and description. The embodiments presented in the foregoing description and the examples disclosed in accordance with these embodiments are provided solely to add context and aid in the understanding of the described embodiments. The foregoing description is not intended to be exhaustive or to limit the described embodiments to the precise form disclosed. Many modifications, alternative uses, and variations are possible in light of the above teaching. In some instances, well known process steps have not been described in detail in order to avoid unnecessarily obscuring the described embodiments.

The above description is only a specific implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech recognition, comprising:

acquiring a voice to be recognized;

obtaining a classification result of phonemes of the voice to be recognized according to a neural network model, wherein the classification result of the phonemes comprises which phoneme in a mixed phoneme set is the phoneme, the mixed phoneme set comprises all phonemes of a first language and a second language, the first language is a target language, and the second language is a native language of a speaker of the voice to be recognized;

determining an evaluation result of the speech to be recognized according to the classification result of the phonemes;

the method further comprises the following steps:

training the neural network model according to the phoneme samples with labels, wherein the labels are the similarity of the phonemes in the mixed phoneme set.

2. The method according to claim 1, wherein the obtaining the classification result of the phoneme of the speech to be recognized according to the neural network model comprises:

acquiring the characteristics of the phonemes of the speech to be recognized;

3. The method of claim 1 wherein the first language is english and the second language is chinese, and wherein the set of mixed phonemes includes all phonemes in english and all phonemes in chinese.

4. The method according to claim 3, wherein the evaluation result of the speech to be recognized comprises: whether the pronunciation of the phoneme is correct or not, and whether the pronunciation of the phoneme is biased to be a Chinese phoneme or an English phoneme.

5. The method according to claim 4, wherein the determining the evaluation result of the speech to be recognized according to the classification result of the phoneme comprises:

6. The method according to claim 4, wherein the evaluation result of the speech to be recognized further includes a score of the speech to be recognized;

the determining the evaluation result of the speech to be recognized according to the classification result of the phonemes comprises:

7. The method of claim 1, further comprising:

obtaining the characteristics of each frame of the audio sample;

8. The method of claim 7, wherein said tagging the phoneme sample comprises:

for phonemes with obvious differences, the similarity is set to 0;

9. The method according to any one of claims 1 to 8, wherein the neural network model is a trigenerative neural network model comprising three input layers, two of the three input layers being for inputting correctly pronounced phoneme samples and the other input layer being for inputting incorrectly pronounced phoneme samples, the incorrectly pronounced phoneme samples comprising phoneme samples that are pronouncedly biased towards phonemes of the second language.

10. An apparatus for speech recognition, comprising:

the acquisition module is used for acquiring the voice to be recognized;

the classification module is used for acquiring a classification result of phonemes of the speech to be recognized according to a neural network model, wherein the classification result of the phonemes comprises which phoneme in a mixed phoneme set the phonemes belong to, the mixed phoneme set comprises all phonemes of a first language and a second language, the first language is a target language, and the second language is a native language of a speaker of the speech to be recognized;

the speech recognition module is used for determining an evaluation result of the speech to be recognized according to the classification result of the phonemes;

the device further comprises:

and the training module is used for training the neural network model according to the phoneme sample with the label, wherein the label is the similarity of the phonemes in the mixed phoneme set.

11. The apparatus of claim 10, wherein the classification module is specifically configured to:

acquiring the characteristics of the phoneme of the voice to be recognized;

12. The apparatus of claim 10 wherein the first language is english and the second language is chinese, and wherein the set of mixed phonemes includes all phonemes in english and all phonemes in chinese.

13. The apparatus according to claim 12, wherein the evaluation result of the speech to be recognized includes: whether the pronunciation of the phoneme is correct or not, and whether the pronunciation of the phoneme is biased to be a Chinese phoneme or an English phoneme.

14. The apparatus of claim 13, wherein the speech recognition module is specifically configured to:

15. The apparatus according to claim 13, wherein the evaluation result of the speech to be recognized further includes a score of the speech to be recognized;

the speech recognition module is specifically configured to:

16. The apparatus of claim 10, wherein the training module is configured to:

obtaining the characteristics of each frame of the audio sample;

and labeling the phoneme sample to obtain the phoneme sample with the label.

17. The apparatus of claim 16, wherein the training module is configured to:

for phonemes with obvious differences, the similarity is set to 0;

18. The apparatus according to any one of claims 10 to 17, wherein the neural network model is a trigenerative neural network model comprising three input layers, two of the three input layers for inputting correctly pronounced phoneme samples and another input layer for inputting incorrectly pronounced phoneme samples, the incorrectly pronounced phoneme samples comprising phoneme samples that are pronouncedly biased towards phonemes of the second language.