CN112687291A

CN112687291A - Pronunciation defect recognition model training method and pronunciation defect recognition method

Info

Publication number: CN112687291A
Application number: CN202011519662.1A
Authority: CN
Inventors: 徐飞扬; 张弢; 李鑫; 凌震华
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2021-04-20
Anticipated expiration: 2040-12-21
Also published as: CN112687291B

Abstract

The application discloses a pronunciation defect recognition model training method and a pronunciation defect recognition method, wherein the pronunciation defect recognition model training method comprises the following steps: after a voice sample and a defect label of the voice sample are obtained, generating voice characteristics of the voice sample according to the voice sample so that the voice characteristics comprise acoustic characteristics, phoneme characteristics and pronunciation attribute characteristics; and training a pronunciation defect recognition model according to the voice characteristics of the voice sample and the defect label of the voice sample. The voice characteristics comprise acoustic characteristics, phoneme characteristics and pronunciation attribute characteristics, so that the voice characteristics can accurately and comprehensively represent pronunciation information carried by the voice sample, and a pronunciation defect recognition model trained based on the voice characteristics and the defect label has better pronunciation defect recognition performance, so that the pronunciation defect can be accurately recognized by the trained pronunciation defect recognition model subsequently.

Description

Pronunciation defect recognition model training method and pronunciation defect recognition method

Technical Field

The application relates to the technical field of computers, in particular to a pronunciation defect recognition model training method and a pronunciation defect recognition method.

Background

The pronunciation deficiency refers to a problem in pronunciation when a person speaks, and the pronunciation deficiency may be manifested in many aspects (e.g., pronunciation inaccuracy, pronunciation difficulty, etc.). Currently, a commonly used pronunciation defect recognition process is to recognize whether a person has pronunciation defects through subjective auditory perception by related people.

However, because the subjective identification of the relevant personnel is easy to be wrong, the identification accuracy of the above commonly used pronunciation defect identification process is low, and how to accurately identify the pronunciation defect becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the present application mainly aims to provide a pronunciation defect recognition model training method and a pronunciation defect recognition method, which can accurately recognize pronunciation defects.

The embodiment of the application provides a pronunciation defect recognition model training method, which comprises the following steps:

acquiring a voice sample and a defect label of the voice sample; wherein the defect label of the voice sample is used for indicating whether the provider of the voice sample has pronunciation defect;

generating voice characteristics of the voice sample according to the voice sample; wherein the speech features include acoustic features, phoneme features, and pronunciation attribute features;

and training a pronunciation defect recognition model according to the voice characteristics of the voice sample and the defect label of the voice sample.

The embodiment of the application also provides a pronunciation defect identification method, which comprises the following steps:

acquiring a target language test voice of a user to be identified;

generating the voice characteristics of the target language test voice according to the target language test voice; wherein the speech features include acoustic features, phoneme features, and pronunciation attribute features;

inputting the voice characteristics of the target language test voice into a pre-trained pronunciation defect recognition model to obtain a first recognition result output by the pronunciation defect recognition model; the pronunciation defect recognition model is trained by any implementation mode of the pronunciation defect recognition model training method provided by the embodiment of the application;

and determining whether the user to be recognized has pronunciation defects according to the first recognition result.

The embodiment of the present application further provides a pronunciation defect recognition model training device, the device includes:

the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring a voice sample and a defect label of the voice sample; wherein the defect label of the voice sample is used for indicating whether the provider of the voice sample has pronunciation defect;

the first generating unit is used for generating the voice characteristics of the voice sample according to the voice sample; wherein the speech features include acoustic features, phoneme features, and pronunciation attribute features;

and the model training unit is used for training a pronunciation defect recognition model according to the voice characteristics of the voice sample and the defect label of the voice sample.

The embodiment of the present application further provides a pronunciation defect recognition device, the device includes:

the second acquisition unit is used for acquiring a target language test voice of the user to be recognized;

the second generating unit is used for generating the voice characteristics of the target language test voice according to the target language test voice; wherein the speech features include acoustic features, phoneme features, and pronunciation attribute features;

the first recognition unit is used for inputting the voice characteristics of the target language test voice into a pre-trained pronunciation defect recognition model to obtain a first recognition result output by the pronunciation defect recognition model; the pronunciation defect recognition model is trained by any implementation mode of the pronunciation defect recognition model training method provided by the embodiment of the application;

and the first determining unit is used for determining whether the user to be recognized has pronunciation defects according to the first recognition result.

Based on the technical scheme, the method has the following beneficial effects:

according to the pronunciation defect recognition model training method, after a voice sample and a defect label of the voice sample are obtained, the voice feature of the voice sample is generated according to the voice sample, so that the voice feature comprises an acoustic feature, a phoneme feature and a pronunciation attribute feature; and training a pronunciation defect recognition model according to the voice characteristics of the voice sample and the defect label of the voice sample. Wherein, the defect label of the voice sample is used for indicating whether the provider of the voice sample has pronunciation defect.

Therefore, the voice characteristics of the voice sample comprise the acoustic characteristics of the voice sample, the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample, so that the voice characteristics of the voice sample can accurately represent pronunciation information carried by the voice sample, the voice characteristics of the voice sample can accurately represent pronunciation performance of a provider of the voice sample, and the pronunciation defect recognition model trained based on the voice characteristics of the voice sample and the defect label has good pronunciation defect recognition performance, so that the pronunciation defect can be accurately recognized by the trained pronunciation defect recognition model subsequently.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a pronunciation defect recognition model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an initial consonant and a final provided in the present application;

FIG. 3 is a diagram illustrating pronunciation attributes of Mandarin consonants provided by an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating pronunciation attributes of Mandarin vowels according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a first model provided in an embodiment of the present application;

FIG. 6 is a schematic workflow diagram of a first model according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a second model provided in the embodiments of the present application;

FIG. 8 is a schematic workflow diagram of a second model provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of another second model provided in the embodiments of the present application;

FIG. 10 is a flowchart of a pronunciation defect recognition method provided by an embodiment of the present application;

fig. 11 is a schematic structural diagram of a pronunciation defect recognition model training apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a pronunciation defect recognition apparatus according to an embodiment of the present application.

Detailed Description

The inventor finds in research on recognizing pronunciation defects that, in the related art, a prediction model can be trained by using acoustic features of training speech data and defect labels of the training speech data to obtain a trained prediction model, so that the trained prediction model can be used for recognizing pronunciation defects. However, since the acoustic features of the training speech data cannot fully characterize the pronunciation information carried by the training speech data (i.e., pronunciation performance of the training speech data provider), the prediction model trained well based on the acoustic features of the training speech data has poor pronunciation defect recognition performance.

In order to solve the technical problems in the background art and the drawbacks of the related art, an embodiment of the present application provides a pronunciation defect recognition model training method, including: acquiring a voice sample and a defect label of the voice sample; generating a voice characteristic of the voice sample according to the voice sample; and training a pronunciation defect recognition model according to the voice characteristics of the voice sample and the defect label of the voice sample. The defect label of the voice sample is used for indicating whether the provider of the voice sample has pronunciation defect; the speech features include acoustic features, phoneme features, and pronunciation attribute features.

In addition, the embodiment of the present application does not limit the execution subject of the pronunciation defect recognition model training method, and for example, the pronunciation defect recognition model training method provided by the embodiment of the present application can be applied to a data processing device such as a terminal device or a server. The terminal device may be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like. The server may be a stand-alone server, a cluster server, or a cloud server.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Method embodiment one

Referring to fig. 1, the figure is a flowchart of a pronunciation defect recognition model training method according to an embodiment of the present application.

The pronunciation defect recognition model training method provided by the embodiment of the application comprises the following steps of S11-S13:

s11: and acquiring a voice sample and a defect label of the voice sample.

The voice sample refers to voice data required by training a pronunciation defect recognition model; in addition, the number of voice samples is not limited in the embodiments of the present application.

In addition, the embodiment of the application does not limit the obtaining mode of the voice sample, for example, the voice sample can be obtained by reading the linguistic data to be recorded by the user to be collected. The user to be collected refers to a provider of the voice sample. The linguistic data to be recorded can be preset according to an application scene; and the corpus to be recorded may be words (e.g., people, hot, bed, pot, etc.), words (e.g., pandas, scissors, peacock, tiger, etc.), restricted sentences (e.g., "this is a girl who likes dancing", "puppy goes outside for play", etc.), and non-restricted continuous streams of speech (e.g., answers to questions asking about work, hobbies, etc.).

In fact, since the provider of the speech sample may use different languages (i.e., different languages) for expression, in order to improve the recognition performance of the pronunciation defect recognition model, the training process of the pronunciation defect recognition model can be performed by using the speech sample in different languages. Therefore, in order to improve the recognition performance of the pronunciation defect recognition model, the embodiment of the application can train the pronunciation defect recognition model by using the voice samples under multiple languages.

In addition, the language of the voice sample can be determined according to the application scene of the pronunciation defect recognition model. In one possible embodiment, if the pronunciation defect recognition model is used for pronunciation defect recognition of speech data in M languages, the T speech samples used for training the pronunciation defect recognition model may include at least one 1 st language speech data, at least one 2 nd language speech data, … …, and at least one M nd language speech data. Wherein M is a positive integer and T is a positive integer. It should be noted that the embodiment of the present application is not limited to the 1 st language to the M th language, for example, the 1 st language may be mandarin, and the 2 nd language to the M th language may be different local dialects of china (e.g., local dialects of china such as north, wu, xiang, gan, guest, min, and yue).

The defect label of the voice sample is used for indicating whether the provider of the voice sample has pronunciation defects; the embodiment of the present application does not limit the obtaining manner of the defect label of the voice sample, for example, the voice sample may be manually labeled by a domain expert, or may be implemented by any existing or future method that can realize the pronunciation defect labeling of the voice sample.

Based on the above-mentioned related content of S11, T voice samples and defect labels of the T voice samples may be obtained according to an application scenario of the pronunciation defect recognition model (for example, when the application scenario of the pronunciation defect recognition model includes a following target language and a following auxiliary language, a voice sample and a defect label thereof in the target language, and a voice sample and a defect label thereof in the auxiliary language may be obtained), so that the languages of the T voice samples can cover all the languages involved in the application scenario of the pronunciation defect recognition model, and the pronunciation defect recognition model trained based on the T voice samples and the defect labels of the T voice samples can accurately recognize pronunciation defects for voice data in each language.

S12: and generating the voice characteristics of the voice sample according to the voice sample.

The voice characteristics of the voice sample are used for representing pronunciation information carried by the voice sample; and the speech features of the speech sample may include acoustic features of the speech sample, phoneme features of the speech sample, and pronunciation attribute features of the speech sample.

The acoustic features of a speech sample are used to characterize the acoustic information (e.g., frequency cepstral coefficients, power normalized cepstral coefficients, etc.) carried by the speech sample. In addition, the acoustic features of the speech sample are not limited in the embodiments of the present application, and the acoustic features of the speech sample may include at least one conventional acoustic feature of the speech sample. For example, the acoustic features of the speech sample may include Mel-Frequency Cepstral coeffients (MFCCs), fbank (filter bank), power-normalized Cepstral coeffients (PNCC), and so on.

The phoneme characteristics of the speech sample are used to characterize the phoneme information carried by the speech sample.

The phoneme is a minimum voice unit divided according to natural attributes of voice; and phonemes in different languages are different. For example, for Mandarin, Mandarin may include 23 initials and 24 finals (as shown in FIG. 2), and the 24 finals include 6 single finals, 9 complex finals, 5 leading nasal finals, and 4 trailing nasal finals. It can be seen that 9 complex finals, 5 anterior nasal finals and 4 posterior nasal finals can be obtained by combining the single finals and the harmony finals, so that the phonemes of mandarin can include 23 initials and 6 single finals.

In addition, the embodiment of the present application does not limit the phoneme characteristics of the speech sample, and for the convenience of understanding, the following description is made with reference to two examples.

Example 1, the phoneme characteristics of the speech sample may include phonemes of the speech sample. Wherein, the phoneme of the voice sample refers to the phoneme composing the voice sample; moreover, the embodiment of the present application does not limit the process of obtaining the phonemes of the speech sample. For example, if the language of the speech sample is the target language, the phonemes of the speech sample may be obtained by using the following first model; if the language of the speech sample is the auxiliary language, the phonemes of the speech sample may be obtained using the following second model.

Example 2, the phoneme feature of the speech sample may be input data of an output layer in a phoneme recognition model, and the output data of the output layer in the phoneme recognition model is a phoneme of the speech sample. The phoneme recognition model is used for carrying out phoneme recognition on the voice data; the embodiment of the present application does not limit the phoneme recognition model, for example, if the language of the speech sample is the target language, the phoneme recognition model may be the following first model; if the language of the speech sample is the auxiliary language, the phoneme recognition model may be the following second model.

The pronunciation attribute characteristics of the voice sample are used for representing pronunciation attribute information carried by the voice sample.

The pronunciation attributes include pronunciation modes and pronunciation parts, and the pronunciation attributes of different phonemes are different. For example, for mandarin chinese, since the pronunciation attribute of the initial "y" is the same as that of the single final "i" and the pronunciation attribute of the initial "w" is the same as that of the single final "u", the pronunciation attribute of mandarin chinese may include 27 pronunciation attributes, and the 27 pronunciation attributes may include the pronunciation attribute of 21 initials as shown in fig. 3 and the pronunciation attribute of 6 single finals as shown in fig. 4.

In addition, the embodiment of the present application does not limit the pronunciation attribute characteristics of the voice sample, and for the convenience of understanding, the following description is made with reference to two examples.

Example 1, the pronunciation attribute characteristics of the speech sample may include pronunciation attributes of the speech sample. The pronunciation attribute of the voice sample refers to the pronunciation attribute of the phoneme composing the voice sample; the embodiment of the present application does not limit the process of acquiring the pronunciation attribute of the voice sample. For example, if the language of the voice sample is the target language, the pronunciation attribute of the voice sample may be obtained by using the following first model; if the language of the voice sample is the auxiliary language, the pronunciation attribute of the voice sample can be obtained by using the following second model.

Example 2, the pronunciation attribute feature of the speech sample may be input data of an output layer in a pronunciation attribute recognition model, and the output data of the output layer in the pronunciation attribute recognition model is the pronunciation attribute of the speech sample. The pronunciation attribute recognition model is used for recognizing pronunciation attributes of voice data; moreover, the embodiment of the present application does not limit the pronunciation attribute recognition model, for example, if the language of the voice sample is the target language, the pronunciation attribute recognition model may be the following first model; if the language of the speech sample is the auxiliary language, the pronunciation attribute recognition model may be a second model as follows.

In addition, the embodiment of S12 is not limited in the examples of the present application, for example, in a possible implementation, S12 may specifically include S121 to S123:

s121: acoustic features of the speech sample are extracted from the speech sample.

The embodiment of the present application is not limited to the implementation of S121, and may be implemented by any existing or future method for extracting acoustic features from voice data. For example, in one possible implementation, S121 may specifically be: and extracting the acoustic features of the voice sample from the voice sample according to a preset acoustic index. The preset acoustic index may be preset, and the embodiment of the present application does not limit the preset acoustic index (for example, the preset acoustic index may include at least one of FBank, MFCCs, and PNCC).

In some cases, the voice sample may include background noise and invalid silence segments, so to improve the accuracy of the acoustic features, the embodiment of the present application further provides a possible implementation manner of S121, which may specifically include S1211-S1212:

s1211: and preprocessing the voice sample to obtain a preprocessed voice sample.

The preprocessing refers to a preset voice processing process; moreover, the embodiment of the present application does not limit the pretreatment. For example, the pre-processing may include at least one of a voice activity detection process, a high-pass filtering noise reduction process, a voice enhancement process, and a voice normalization process. Wherein the voice activity detection process is used to accurately locate the start point and the end point of the voice signal. A high pass filtering noise reduction process is used to reduce the background noise of the speech signal. The speech enhancement process is used to extract as clean as possible original speech from a noisy speech signal.

Based on the relevant content of S1211, after the voice sample is obtained, the voice sample may be preprocessed to obtain a preprocessed voice sample. For example, when the preprocessing includes a voice activity detection process, a high-pass filtering noise reduction process, a voice enhancement process, and a voice normalization process, the voice activity detection process may be performed on the voice sample to obtain valid voice data, so that the valid voice data does not include an invalid silence segment; then, carrying out high-pass filtering and noise reduction on the effective voice data to obtain low-noise voice data; then carrying out a voice enhancement process on the low-noise voice data to obtain useful voice data; and then, carrying out a voice normalization process on the useful voice data to obtain a preprocessed voice sample, so that the preprocessed voice sample hardly comprises background noise and invalid silence segments.

S1212: and determining the acoustic characteristics of the voice sample according to the preprocessed voice sample.

In the embodiment of the application, after the preprocessed voice sample is obtained, the acoustic features of the voice sample can be directly extracted from the preprocessed voice sample. The acoustic characteristics of the voice sample determined based on the preprocessed voice sample are more accurate because the preprocessed voice sample hardly comprises background noise and invalid mute sections, so that the recognition accuracy of the pronunciation defect recognition model is improved.

Based on the related content of S121, in this embodiment of the application, after the voice sample is obtained, the voice sample may be preprocessed first to obtain a preprocessed voice sample; extracting acoustic features of the voice sample from the preprocessed voice sample according to preset acoustic indexes so that the acoustic features can accurately represent acoustic information carried by the voice sample; and the extraction process can be specifically as follows: when the preset acoustic index includes FBank, MFCCs, and PNCC, determining an FBank value, an MFCCs value, and a PNCC value of the voice sample according to the preset acoustic index and the preprocessed voice sample; and generating the acoustic characteristics of the voice sample according to the FBank value, the MFCCs value and the PNCC value of the voice sample, so that the acoustic characteristics of the voice sample comprise the FBank value, the MFCCs value and the PNCC value of the voice sample.

S122: and determining the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample according to the acoustic characteristics of the voice sample.

The embodiment of the application does not limit the acquisition process of the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample, and the phoneme characteristics and the pronunciation attribute characteristics of the voice sample can be acquired simultaneously or sequentially. For the convenience of understanding, the following description will take the example of simultaneously acquiring the phoneme characteristics and pronunciation attribute characteristics of the voice sample.

As an example, S122 may specifically be: and obtaining the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample according to the acoustic characteristics of the voice sample and the pre-trained multi-task learning model.

The multi-task learning model is used for a phoneme recognition process and a pronunciation attribute recognition process, and is trained in a multi-task learning mode.

In addition, because the phonemes and pronunciation attributes in different languages are different, different multi-task learning models can be trained for different languages, so that each trained multi-task learning model can be specially used for the phoneme recognition process and the pronunciation attribute recognition process in at least one language. Therefore, the speech samples in different languages can adopt different multi-task learning models to perform phoneme recognition and pronunciation attribute recognition. Based on this, the present application provides a possible implementation manner of S122, which may specifically include S1221 to S1222:

s1221: and when the language of the voice sample is the target language, obtaining the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample according to the acoustic characteristics of the voice sample and the pre-trained first model.

The target language is not limited in the present application, and for example, the target language may be a language (e.g., mandarin chinese) which is widely used.

The first model is used for performing a phoneme recognition process and a pronunciation attribute recognition process under a target language; the embodiment of the present application does not limit the model structure of the first model. For example, as shown in fig. 5, the first model may include a first shared parameter layer, a target language phoneme recognition layer, and a target language pronunciation attribute recognition layer; and the input data of the target language phoneme recognition layer and the input data of the target language pronunciation attribute recognition layer are both output data of the first shared parameter layer.

The first shared parameter layer is used for extracting features of input data (such as acoustic features of a voice sample in a target language, acoustic features of a test voice in the following target language and the like) of the first model; furthermore, the first shared parameter layer is not limited in the embodiments of the present application, for example, the first shared parameter layer may include at least one of a Convolutional Neural Network (CNN), a fully-connected layer, and a Recurrent Neural Network (RNN).

The target language phoneme recognition layer is used for carrying out phoneme recognition according to the output data of the first shared parameter layer; furthermore, the embodiment of the present application is not limited to the target language phoneme recognition layer, for example, the target language phoneme recognition layer may be a multi-classifier. In addition, the target language phoneme recognition layer may include a first sub-output layer (as shown in fig. 6); and the first sub-output layer is used for outputting the phoneme recognition result, so that the phoneme recognition result is used for representing the phoneme carried by the input data of the first model. It should be noted that the first sub output layer is not limited in the embodiments of the present application, for example, the first sub output layer may be a full connection layer.

The target language pronunciation attribute identification layer is used for carrying out pronunciation attribute identification according to the output data of the first shared parameter layer; the embodiment of the present application does not limit the target language pronunciation attribute identification layer, for example, the target language pronunciation attribute identification layer may be a multi-classifier. In addition, the target language pronunciation attribute recognition layer may include a second sub-output layer (as shown in fig. 6); and the second sub-output layer is used for outputting the pronunciation attribute recognition result, so that the pronunciation attribute recognition result is used for representing the pronunciation attributes carried by the input data of the second model. It should be noted that the second sub output layer is not limited in the embodiments of the present application, for example, the second sub output layer may be a full connection layer.

Based on the above-mentioned related contents of the model structure of the first model, the first model may perform phoneme recognition and pronunciation attribute recognition on the input data (e.g., the acoustic features of the speech sample in the target language, the acoustic features of the test speech in the following target language, etc.) of the first model according to the data processing procedure shown in fig. 6, so as to obtain and output the predicted phoneme and the predicted pronunciation attribute of the input data of the first model.

In addition, the first model can be trained in a multitask learning mode; and the first model can be obtained by training according to the acoustic features of the training speech of the target language, the actual phonemes of the training speech of the target language and the actual pronunciation attributes of the training speech of the target language.

The target language training voice is voice data used for training the first model in the target language, and the acoustic features of the target language training voice can serve as input data of the first model in the training process of the first model. In addition, the number of the target language training voices is not limited in the embodiment of the application. In addition, the embodiment of the present application also does not limit the obtaining manner of the training speech of the target language, and for example, any implementation manner of obtaining the speech sample may be adopted for implementation.

The actual phoneme of the target language training voice refers to the actual phoneme carried in the target language training voice; and the actual phoneme of the target language training speech can play a role of the label information of the first model in the training process of the first model, so that the actual phoneme of the target language training speech can guide the updating process of the first model. In addition, the embodiment of the present application does not limit the obtaining manner of the actual phonemes of the target language training speech, for example, the actual phonemes of the target language training speech may be manually labeled by a domain expert according to the phoneme table in the target language.

The actual pronunciation attribute of the target language training voice refers to the actual pronunciation attribute carried in the target language training voice; and the actual pronunciation attribute of the target language training voice can serve as the role of the label information of the first model in the training process of the first model, so that the actual pronunciation attribute of the target language training voice can guide the updating process of the first model. In addition, the embodiment of the present application does not limit the obtaining manner of the actual pronunciation attribute of the target language training speech, for example, the actual pronunciation attribute of the target language training speech may be manually labeled by a domain expert according to the pronunciation attribute table in the target language.

The training process of the first model is not limited in the embodiment of the application, and any existing or future multi-task model training method can be adopted for implementation. For example, in one possible implementation, the training process of the first model may include steps 11-13:

step 11: inputting the acoustic features of the target language training voice into a first model to obtain the predicted phonemes of the target language training voice and the predicted pronunciation attributes of the target language training voice, which are output by the first model.

Step 12: judging whether a first stopping condition is reached, if so, ending the training process of the first model; if not, go to step 13.

Wherein the first stop condition may be preset. In addition, the embodiment of the present application is not limited to the first stop condition, for example, the first stop condition may be that a difference between predicted information (for example, a predicted phoneme and a predicted pronunciation attribute) corresponding to the training speech of the target language and actual information (for example, an actual phoneme and an actual pronunciation attribute) corresponding to the training speech of the target language reaches a first difference threshold, a change rate of the predicted information corresponding to the training speech of the target language is lower than a first change threshold, or the number of updates of the first model reaches a first time threshold.

In the embodiment of the application, whether the first model of the current wheel reaches the first stop condition or not can be judged, if so, the first model of the current wheel has higher identification performance, so that the first model of the current wheel can be saved and/or used; if not, the first model of the current round has poor recognition performance, so that the first model can be updated according to the difference between the predicted phoneme and the actual phoneme of the training speech of the target language and the difference between the predicted pronunciation attribute and the actual pronunciation attribute of the training speech of the target language, so that the updated first model has higher recognition performance.

Step 13: and updating the first model according to the predicted phoneme of the target language training voice, the actual phoneme of the target language training voice, the predicted pronunciation attribute of the target language training voice and the actual pronunciation attribute of the target language training voice, and returning to execute the step 11.

It should be noted that, the embodiment of the present application does not limit the updating manner of the first model, and may be implemented by using any existing or future multitask model updating process.

Based on the above-mentioned related contents from step 11 to step 13, the first model may be trained according to the acoustic features, the actual phonemes, and the actual pronunciation attributes of the training speech of the target language, so that the trained first model has a higher recognition performance, thereby enabling the trained first model to be subsequently used to perform the phoneme recognition process and the pronunciation attribute recognition process in the target language.

In addition, the embodiment of the present application is not limited to the implementation of S1221, for example, in a possible implementation, when the first model includes a target language phoneme recognition layer and a target language pronunciation attribute recognition layer, the target language phoneme recognition layer includes a first sub-output layer, and the target language pronunciation attribute recognition layer includes a second sub-output layer, S1221 may specifically be: if the language of the voice sample is the target language, firstly inputting the acoustic features of the voice sample into a first model to obtain first input data of a first sub-output layer and first input data of a second sub-output layer; and determining the first input data of the first sub-output layer as the phoneme characteristics of the voice sample, and determining the first input data of the second sub-output layer as the pronunciation attribute characteristics of the voice sample.

The first input data of the first sub-output layer refers to data input into the first sub-output layer when the first model performs phoneme recognition and pronunciation attribute recognition according to the acoustic features of the speech sample. Similarly, the first input data of the second sub-output layer refers to data input into the second sub-output layer when the first model performs phoneme recognition and pronunciation attribute recognition according to the acoustic features of the speech sample.

It can be seen that, for a speech sample in a target language (that is, the language of the speech sample is the target language), the acoustic features of the speech sample may be directly input to the first model, so that the first model performs phoneme recognition and pronunciation attribute recognition according to the acoustic features of the speech sample, so as to determine the data input to the first sub-output layer in the first model as the phoneme features of the speech sample, and determine the data input to the second sub-output layer in the first model as the pronunciation attribute features of the speech sample.

Based on the above-mentioned related contents of S1221, if the language of the speech sample is the target language, the first model can perform phoneme recognition and pronunciation attribute recognition according to the acoustic features of the speech sample, so as to obtain the phoneme features of the speech sample and the pronunciation attribute features of the speech sample.

S1222: and when the language of the voice sample is the auxiliary language, obtaining the phoneme characteristic of the voice sample and the pronunciation attribute characteristic of the voice sample according to the acoustic characteristic of the voice sample and a pre-trained second model.

The embodiment of the present application does not limit the auxiliary language, for example, the auxiliary language may be a language with a small use range. In addition, the auxiliary language may be related to the target language, for example, if the target language is mandarin, the auxiliary language may be any one of local languages of china. In addition, the number of the auxiliary languages is not limited in the embodiments of the present application.

The second model is used for assisting the phoneme recognition process and the pronunciation attribute recognition process under the language; the second model is not limited in the embodiments of the present application, and for example, the second model may be used only for performing the phoneme recognition process and the pronunciation attribute recognition process in the auxiliary language. For another example, since there is less speech data in the auxiliary language, in order to improve the recognition accuracy of the second model, the second model may be used to perform a phoneme recognition process and a pronunciation attribute recognition process in N languages, where the N languages include the auxiliary language. Wherein N is a positive integer. It should be noted that N languages may be preset according to the application scenario. For example, when the target language is mandarin, the N languages may be N chinese local dialects.

In addition, the embodiment of the present application does not limit the model structure of the second model. For example, when the second model is used to perform a phoneme recognition process in N languages and a pronunciation attribute recognition process, as shown in fig. 7, the second model may include a second shared parameter layer, an N-language phoneme recognition layer, and an N-language pronunciation attribute recognition layer; and the input data of the N language phoneme recognition layers and the input data of the N language pronunciation attribute recognition layers are output data of the second shared parameter layer.

The second shared parameter layer is used for performing feature extraction on input data (such as acoustic features of a voice sample in an auxiliary language, acoustic features of a test voice in the following auxiliary language, and the like) of the second model; also, the second shared parameter layer is not limited in the embodiments of the present application, and for example, the second shared parameter layer may include at least one of a CNN, a fully-connected layer, and an RNN. In addition, the embodiment of the present application does not limit the initialization layer parameter of the second shared parameter layer, and for example, in order to improve the training efficiency of the second model, the layer parameter of the first shared parameter layer in the trained first model may be determined as the initialization layer parameter of the second shared parameter layer.

The ith language phoneme recognition layer is used for performing phoneme recognition process under the ith language; furthermore, the embodiment of the present application is not limited to the ith language phoneme recognition layer, for example, the ith language phoneme recognition layer may be a multi-classifier. In addition, the ith language phoneme recognition layer may include a pth_iSub-output layers (as shown in FIG. 8); and the P (th) is_iThe sub-output layer is used for outputting a phoneme recognition result in the ith language, so that the phoneme recognition result is used for representing a phoneme carried by the input data of the second model when the language of the input data is the ith language. In the embodiments of the present application, the P-th order is not limited_iSub-output layers, e.g. P-th_iThe sub-output layer may be a fully connected layer. Wherein i is a positive integer, and i is not more than N.

The ith language pronunciation attribute recognition layer is used for performing a pronunciation attribute recognition process under the ith language; the embodiment of the present application is not limited to the ith oneThe language pronunciation attribute identification layer, for example, the ith language pronunciation attribute identification layer may be a multi-classifier. In addition, the ith language pronunciation attribute recognition layer may include the A < th > language pronunciation attribute recognition layer_iSub-output layers (as shown in FIG. 8); and the first_iThe sub-output layer is used for outputting a pronunciation attribute recognition result in the ith language, so that the pronunciation attribute recognition result is used for representing the pronunciation attribute carried by the input data of the second model when the language of the input data is the ith language. In the examples of the present application, the number a is not limited_iSub-output layers, e.g. A_iThe sub-output layer may be a fully connected layer. Wherein i is a positive integer, and i is not more than N.

Based on the above-mentioned related contents of the model structure of the second model, the second model may be a multi-language recognition model, so that the second model can be used for performing phoneme recognition and pronunciation attribute recognition in multiple languages. As shown in fig. 8, in the second model, if the language of the input data of the second model is the ith language, the i-th language phoneme recognition layer in the second model performs phoneme recognition on the input data of the second model to obtain and output a predicted phoneme of the input data of the second model, and the i-th language pronunciation attribute recognition layer in the second model performs pronunciation attribute recognition on the input data of the second model to obtain and output a predicted pronunciation attribute of the input data of the second model.

It can be seen that, when the second model is used for performing the phoneme recognition process and the pronunciation attribute recognition process in N languages, the second model needs to select the phoneme recognition layer and the pronunciation attribute recognition layer according to the language of the input data of the model, so that the language of the input data of the second model needs to be used when performing the phoneme recognition and the pronunciation attribute recognition by using the second model, and therefore, in order to meet the requirement, the embodiments of the present application provide the following two solutions.

The first scheme is as follows: as shown in fig. 9, a language identification layer may be added to the second model so that the language identification layer can perform language identification according to the output data of the second shared parameter layer, and obtain and output a predicted language of the input data of the second model, so that the second model can select a phoneme identification layer and a pronunciation attribute identification layer based on the predicted language. Therefore, the language of the input data of the second model can be obtained by adding a language identification layer to the second model (i.e., by adding a learning task of the second model).

Scheme II: the language identification may be directly input to the second model so that the input data of the second model may include the speech data and the language identification of the speech data, thereby enabling the second model to select the phoneme recognition layer and the pronunciation attribute recognition layer based on the language identification carried by the input data. Therefore, the language of the input data of the second model can be obtained by directly inputting the language identification to the second model in the embodiment of the application.

It should be noted that, in the embodiment of the present application, the "language identification of the voice data" in the second scheme is not limited, for example, the "language identification of the voice data" in the second scheme may refer to an actual language identification of the voice data. For another example, the obtaining process of the "language identification of the voice data" in the second scheme may be: inputting the voice data into a pre-trained language identification model to obtain a predicted language identification output by the language identification model, and determining the predicted language identification as the language identification of the voice data. In addition, the language identification model is not limited in the embodiments of the present application, and any existing or future model (e.g., the language classification model below) that can implement language identification may be used for implementation.

In addition, the second model can be trained in a multitask learning mode; and when the second model is applied to phoneme recognition and pronunciation attribute recognition under N languages, the second model can be obtained by training according to the acoustic features of the N language training voices, the actual phonemes of the N language training voices and the actual pronunciation attributes of the N language training voices.

The ith language training voice is voice data used for training the second model in the ith language; the acoustic feature of the ith language training voice is used for representing acoustic information carried in the ith language training voice; the actual phoneme of the ith language training voice refers to the actual phoneme carried in the ith language training voice; the actual pronunciation attribute of the ith language training voice refers to the actual pronunciation attribute carried in the ith language training voice. Wherein i is a positive integer, and i is not more than N.

It should be noted that the embodiment of the present application does not limit the manner of obtaining the training speech of the ith language, and for example, the embodiment may be implemented by any of the above embodiments of obtaining a speech sample. In addition, the embodiment of the present application also does not limit the manner of obtaining the actual phoneme of the training speech of the ith language and the actual pronunciation attribute of the training speech of the ith language, for example, a domain expert may perform artificial phoneme labeling and artificial pronunciation attribute labeling on the ith training speech according to the phoneme table and the pronunciation attribute table in the ith language.

In addition, the acoustic features of the N languages training speech can play a role of the input data of the second model in the training process of the second model; and the actual phonemes of the N-language training voices and the actual pronunciation attributes of the N-language training voices can both serve as the role of the label information of the second model in the training process of the second model, so that the actual phonemes and the actual pronunciation attributes of the N-language training voices can both guide the updating process of the second model.

It should be noted that, in the embodiment of the present application, the training process of the second model is not limited, and any multitask model training method may be used for implementation.

Based on the related content of the training process of the second model, when the second model is applied to phoneme recognition and pronunciation attribute recognition in N languages, if the N languages include the auxiliary language, it may be determined that the second model is trained according to the acoustic features of the auxiliary language training speech, the actual phonemes of the auxiliary language training speech, and the actual pronunciation attributes of the auxiliary language training speech, so that the trained second model can perform phoneme recognition and pronunciation attribute recognition on the speech data in the auxiliary language (e.g., the speech sample in the auxiliary language or the following auxiliary language test speech). The language-assisted training speech is speech data used for training the second model in the language-assisted training.

In addition, the embodiment of the present application is not limited to the implementation of S1222, for example, in a possible implementation, when the second model includes N language phoneme recognition layers and N language pronunciation attribute recognition layers, the ith language phoneme recognition layer includes the pth_iThe ith language pronunciation attribute recognition layer comprises an A_iIn the sub-output layers, i is a positive integer, i is not greater than N, and when N is a positive integer, S1222 may specifically be: if the language of the voice sample is the ith language (i.e. the auxiliary language is the ith language), the acoustic features of the voice sample and the language identification of the voice sample are input into the second model to obtain the pth_iInput data and A-th of sub-output layer_iInput data of the sub-output layer; then will P_iDetermining the input data of the sub-output layer as the phoneme characteristics of the voice sample, and adding A_iAnd determining the input data of the sub-output layer as the pronunciation attribute characteristics of the voice sample. Wherein the language identifier is used to uniquely identify a language.

It can be seen that, for a speech sample in the ith language (that is, the language of the speech sample is the ith language), after obtaining the acoustic features of the speech sample, the acoustic features of the speech sample and the language identifier of the speech sample may be directly input to the second model, so that the second model may perform phoneme recognition and pronunciation attribute recognition according to the acoustic features and the language identifier of the speech sample, so as to identify the pth in the first model_iInput data and A-th of sub-output layer_iThe input data of the sub-output layer is respectively determined as the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample.

Based on the relevant contents of the above S1221 to S1222, after the voice sample is obtained, the language of the voice sample may be determined; and determining the phoneme characteristics and pronunciation attribute characteristics of the voice sample according to the language of the voice sample and the acoustic characteristics of the voice sample, wherein the determination process specifically comprises the following steps: when the language of the voice sample is the target language, obtaining the phoneme characteristics and pronunciation attribute characteristics of the voice sample by the first model according to the acoustic characteristics of the voice sample; when the language of the speech sample is an auxiliary language (e.g., the ith language), and the N languages corresponding to the second model include the auxiliary language, the second model obtains the phoneme characteristics and pronunciation attribute characteristics of the speech sample according to the acoustic characteristics and language identification of the speech sample.

It should be noted that, the language of the speech sample is not limited in the embodiments of the present application, for example, the language of the speech sample may refer to the actual language of the speech sample. For another example, the obtaining process of the language of the voice sample may be: inputting the voice sample into a pre-trained language identification model to obtain a predicted language identification output by the language identification model, and determining the predicted language identification as the language of the voice sample. The relevant content of the language identification model is referred to above.

S123: and determining the voice characteristics of the voice sample according to the acoustic characteristics of the voice sample, the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample.

In this embodiment of the application, after the acoustic feature, the phoneme feature and the pronunciation attribute feature of the voice sample are obtained, the voice feature of the voice sample may be generated according to the acoustic feature of the voice sample, the phoneme feature of the voice sample and the pronunciation attribute feature of the voice sample, so that the voice feature of the voice sample can represent the acoustic information, the phoneme information and the pronunciation attribute information carried by the voice sample.

The embodiment of the present application is not limited to the implementation of S123, and for example, the set of acoustic features of the speech sample, phoneme features of the speech sample, and pronunciation attribute features of the speech sample may be determined as the speech features of the speech sample.

Based on the relevant content in S121 to S123, after the voice sample is obtained, the acoustic feature of the voice sample may be determined according to the voice sample, so that the acoustic feature can represent the acoustic information carried in the voice sample; determining the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample according to the acoustic characteristics of the voice sample so that the phoneme characteristics of the voice sample can represent the phoneme information carried in the voice sample and the pronunciation attribute characteristics of the voice sample can represent the pronunciation attribute information carried in the voice sample; and finally, determining the voice characteristics of the voice sample according to the acoustic characteristics, the phoneme characteristics and the pronunciation attribute characteristics of the voice sample, so that the voice characteristics of the voice sample can represent the acoustic information, the phoneme information and the pronunciation attribute information carried by the voice sample, and the voice characteristics of the voice sample can more comprehensively represent the pronunciation information carried by the voice sample.

S13: and training a pronunciation defect recognition model according to the voice characteristics of the voice sample and the defect label of the voice sample.

In the embodiment of the application, after the voice features and the defect labels of the voice samples are obtained, the voice features of the voice samples and the defect labels of the voice samples can be used for training the pronunciation defect recognition model, so that the trained pronunciation defect recognition model has high pronunciation defect recognition performance. The method comprises the following steps that voice characteristics of a voice sample serve as the role of input data of a pronunciation defect recognition model in the training process of the pronunciation defect recognition model; and the defect label of the voice sample plays a role of label information in the training process of the pronunciation defect recognition model, so that the defect label of the voice sample can guide the updating process of the pronunciation defect recognition model.

It should be noted that the embodiment of the present application is not limited to the implementation of S13, and may be implemented by any existing or future model training method. In addition, the embodiment of the present application is not limited to the model structure of the pronunciation defect recognition model, and may be implemented by using the model structure of any existing or future machine learning model.

Based on the relevant contents of S11 to S13, in the pronunciation defect recognition model training method provided by the present application, after acquiring a voice sample and a defect label of the voice sample, a voice feature of the voice sample is generated according to the voice sample, so that the voice feature includes an acoustic feature, a phoneme feature and a pronunciation attribute feature; and training a pronunciation defect recognition model according to the voice characteristics of the voice sample and the defect label of the voice sample. Wherein, the defect label of the voice sample is used for indicating whether the provider of the voice sample has pronunciation defect.

Based on the related content of the pronunciation defect recognition model training method, the trained pronunciation defect recognition model can be used for pronunciation defect recognition. Based on this, the embodiment of the present application further provides a pronunciation defect recognition method, which is explained and explained below with reference to the drawings.

Method embodiment two

Referring to fig. 10, it is a flowchart of a pronunciation defect recognition method provided in the embodiment of the present application.

The pronunciation defect identification method provided by the embodiment of the application comprises the following steps of S101-S104:

s101: and acquiring a target language test voice of the user to be recognized.

The user to be identified refers to a user needing to judge whether pronunciation defects exist or not; and the user to be identified is also the provider of the target language test voice. In addition, the user to be recognized can be expressed in the target language.

The target language test voice is voice data which is needed to be used when judging whether the user to be identified has pronunciation defects; and the language of the target language test voice is the target language.

In addition, the embodiment of the present application does not limit the obtaining manner of the target language test voice, for example, the target language test voice may be obtained by reading the corpus to be tested by the user to be recognized using the target language. Wherein, the linguistic data to be tested can be preset; and the corpus to be tested is similar to the "corpus to be recorded" above.

Based on the related content of S101, if it is desired to determine whether the user to be recognized has a pronunciation defect, the user to be recognized may read the preset expected to be tested using the target language to obtain the target language test voice, so that it is possible to subsequently determine whether the user to be recognized has a pronunciation defect according to the target language test voice.

S102: and generating the voice characteristics of the target language test voice according to the target language test voice.

The voice characteristics of the target language test voice are used for representing pronunciation information carried by the target language test voice.

In addition, the speech feature of the target language test speech may include an acoustic feature of the target language test speech, a phoneme feature of the target language test speech, and a pronunciation attribute feature of the target language test speech. The acoustic characteristics of the target language test voice are used for representing acoustic information carried by the target language test voice; the phoneme characteristics of the target language test voice are used for representing phoneme information carried by the target language test voice; and the pronunciation attribute feature of the target language test voice is used for representing pronunciation attribute information carried by the target language test voice.

It should be noted that, S102 may be implemented by any embodiment of obtaining the speech features of the speech sample in the target language (i.e., the language of the speech sample is the target language) in S12, and only the "speech sample" of any embodiment of obtaining the speech features of the speech sample in the target language in S12 needs to be replaced by the "test speech in the target language". For ease of understanding, the following description is made with reference to examples.

As an example, S102 may specifically include S1021-S1023:

s1021: and extracting the acoustic features of the target language test voice from the target language test voice.

S1021 may be implemented by any of the embodiments of S121, and only the "voice sample" in any of the embodiments of S121 needs to be replaced with the "target language test voice".

S1022: and determining the phoneme characteristics of the target language test voice and the pronunciation attribute characteristics of the target language test voice according to the acoustic characteristics of the target language test voice.

S1022 may be implemented by any embodiment of S1221 above, and only "voice sample" in any embodiment of S1221 above needs to be replaced with "target language test voice". For example, in one possible implementation, S1022 may specifically be: and obtaining the phoneme characteristic of the target language test voice and the pronunciation attribute characteristic of the target language test voice according to the acoustic characteristic of the target language test voice and the pre-trained first model.

The relevant content of the "first model" refers to the relevant content of the "first model" in S1221 above.

In addition, the phoneme characteristics of the target language test speech are not limited in the embodiments of the present application, for example, the phoneme characteristics of the target language test speech may be predicted phonemes obtained and output by the first model through phoneme recognition according to the acoustic characteristics of the target language test speech. For another example, the phoneme characteristics of the target language test speech may also be data input to the first sub-output layer when phoneme recognition is performed by the first model based on the acoustic characteristics of the target language test speech.

In addition, the pronunciation attribute feature of the target language test speech is not limited in the embodiments of the present application, for example, the pronunciation attribute feature of the target language test speech may be a predicted pronunciation attribute that is obtained and output by the first model through pronunciation attribute recognition according to the acoustic feature of the target language test speech. For another example, the pronunciation attribute feature of the target language test speech may also be data input to the second sub-output layer when the first model performs pronunciation attribute recognition based on the acoustic feature of the target language test speech.

To facilitate understanding of the phoneme characteristics and pronunciation attributes of the target language test speech, the following description is made with reference to examples.

As an example, when the first model includes a target language phoneme recognition layer and a target language pronunciation attribute recognition layer, the target language phoneme recognition layer includes a first sub-output layer, and the target language pronunciation attribute recognition layer includes a second sub-output layer, S1022 may specifically be: inputting the acoustic characteristics of the target language test voice into a first model to obtain second input data of a first sub-output layer and second input data of a second sub-output layer; and determining the second input data of the first sub-output layer as the phoneme characteristics of the target language test voice, and determining the second input data of the second sub-output layer as the pronunciation attribute characteristics of the target language test voice.

The second input data of the first sub-output layer refers to data input into the first sub-output layer when the first model performs phoneme recognition and pronunciation attribute recognition according to the acoustic features of the target language test speech. Similarly, the second input data of the second sub-output layer refers to data input into the second sub-output layer when the first model performs phoneme recognition and pronunciation attribute recognition according to the acoustic features of the target language test speech.

Based on the above related content of S1022, after obtaining the acoustic feature of the target language test speech, the trained first model may perform phoneme recognition and pronunciation attribute recognition according to the acoustic feature of the target language test speech, so as to obtain the phoneme feature of the target language test speech and the pronunciation attribute feature of the target language test speech.

S1023: and determining the voice characteristics of the target language test voice according to the acoustic characteristics of the target language test voice, the phoneme characteristics of the target language test voice and the pronunciation attribute characteristics of the target language test voice.

It should be noted that S1023 may be implemented by any embodiment of S123, and only "voice sample" in any embodiment of S123 needs to be replaced with "target language test voice".

Based on the related contents of the above S1021 to S1023, after the target language test speech is obtained, the acoustic feature of the target language test speech may be determined according to the target language test speech; determining the phoneme characteristics and pronunciation attribute characteristics of the target language test voice according to the acoustic characteristics of the target language test voice; and finally, determining the voice characteristics of the target language test voice according to the acoustic characteristics, the phoneme characteristics and the pronunciation attribute characteristics of the target language test voice, so that the voice characteristics of the target language test voice can represent the acoustic information, the phoneme information and the pronunciation attribute information of the target language test voice, and the voice characteristics of the target language test voice can more comprehensively represent the pronunciation information carried by the target language test voice.

S103: and inputting the voice characteristics of the target language test voice into a pre-trained pronunciation defect recognition model to obtain a first recognition result output by the pronunciation defect recognition model.

The pronunciation defect identification model is used for carrying out a pronunciation defect identification process; moreover, the pronunciation defect recognition model can be trained by any implementation mode of the pronunciation defect recognition model training method provided by the embodiment of the application.

The first recognition result is used for indicating whether the user to be recognized (namely the provider of the target language test voice) has pronunciation defects. In addition, the embodiment of the present application does not limit the manner of representing the first recognition result, for example, the first recognition result may be represented by a preset character, which specifically includes: if the first recognition result is '1', the fact that the user to be recognized has pronunciation defects is represented; and if the first recognition result is '0', the fact that the user to be recognized does not have pronunciation defects is shown. For another example, the first recognition result may be represented by a probability value, which is specifically: the first recognition result may include a first probability value representing a possibility that the user to be recognized has the pronunciation defect and/or a second probability value representing a possibility that the user to be recognized does not have the pronunciation defect.

Based on the related content of the above S103, after the voice feature of the target language test voice is obtained, the voice feature of the target language test voice may be directly input to the pre-trained pronunciation defect recognition model, so that the pronunciation defect recognition model performs pronunciation defect recognition according to the voice feature of the target language test voice, and obtains and outputs a first recognition result, so that the first recognition result can accurately indicate whether the user to be recognized has pronunciation defects.

S104: and determining whether the user to be recognized has pronunciation defects according to the first recognition result.

In this embodiment of the application, after the first recognition result is obtained, whether the user to be recognized has a pronunciation defect may be determined according to the first recognition result, and the determining process may specifically be: judging whether the first recognition result meets a first condition or not; if yes, determining that the user to be identified has pronunciation defects; if not, determining that the user to be recognized does not have pronunciation defects.

Wherein the first condition may be preset; also, the embodiment of the present application does not limit the first condition, and for example, when the first recognition result is represented by a preset character, the first condition may be set to "the first recognition result is 1". For another example, when the first recognition result can be represented by a probability value, the first condition can be set to "the first probability value is greater than or equal to the first probability threshold" and/or "the second probability value < the first probability threshold". The first probability threshold may be set in advance, and for example, the first probability threshold may be 0.5.

Based on the related contents of the above S101 to S104, after the target language test voice of the user to be recognized is obtained, the voice feature of the target language test voice is generated according to the target language test voice; inputting the voice characteristics of the target language test voice into a pre-trained pronunciation defect recognition model to obtain a first recognition result output by the pronunciation defect recognition model; and finally, determining whether the user to be recognized has pronunciation defects or not according to the first recognition result.

The pronunciation information carried by the target language test voice can be comprehensively represented by the voice characteristics of the target language test voice, and the trained pronunciation defect recognition model has better pronunciation defect recognition performance, so that the pronunciation defect recognition model performs pronunciation defect recognition according to the voice characteristics of the target language test voice, and the obtained and output first recognition result can accurately represent whether the user to be recognized has pronunciation defects, so that the conclusion whether the user to be recognized has pronunciation defects or not obtained based on the first recognition result is more accurate.

Method embodiment three

In some cases, if the user to be recognized grasps the target language (e.g., mandarin) and the auxiliary language (e.g., chinese dialect), the pronunciation error of the user to be recognized is likely to occur when the user to be recognized expresses the target language due to the pronunciation influence of the auxiliary language. For example, since the pronunciation of the combined fertile Chinese "east" is the same as the pronunciation of the common Chinese "east", but the pronunciation of the combined fertile Chinese "west" si "(first sound) and the pronunciation of the common Chinese" west "xi" (first sound), for the user to be recognized who has mastered the combined fertile Chinese and the common Chinese at the same time, the user to be recognized can easily read the pronunciation of the "dong si" when reading the "east" by using the common Chinese, resulting in the pronunciation error of the user to be recognized. It can be seen that, in some cases, the pronunciation defect of the user to be recognized may occur because the auxiliary language that the user to be recognized already possesses has an excessive influence on the user to be recognized.

In addition, the user to be recognized may also have problems with his pronunciation due to own physical problems (e.g., pharyngeal problems, tongue problems, etc.). It can be seen that, in some cases, the cause of the pronunciation defect of the user to be recognized may also be the user's own physical problem to be recognized.

Based on the analysis, the occurrence reasons of the pronunciation defects of the user to be recognized are various, so that after the pronunciation defects of the user to be recognized are determined, the occurrence reasons of the pronunciation defects of the user to be recognized can be further recognized. Based on this, the present application embodiment also provides a possible implementation manner of the pronunciation defect recognition method, in which when the user to be recognized grasps the target language and the auxiliary language, the pronunciation defect recognition method may further include steps 21 to 25 in addition to the above-mentioned S101 to S104:

step 21: and when the first recognition result is determined to meet the first condition, determining that the user to be recognized has pronunciation defects. The relevant content of the "first condition" may be referred to as the relevant content of the "first condition" in S104 above.

Step 22: and acquiring the auxiliary language test voice of the user to be recognized.

The auxiliary language test voice is recorded by the user to be recognized by using the auxiliary language; and the semantic information carried by the auxiliary language type test voice is the same as the semantic information carried by the target language type test voice.

In addition, the embodiment of the present application does not limit the auxiliary language, for example, the auxiliary language may be a language with a small use range. In addition, the embodiment of the present application does not limit the obtaining manner of the auxiliary language testing voice, for example, if the target language testing voice is obtained by the user to be recognized reading the corpus to be tested in the target language, the auxiliary language testing voice may be obtained by the user to be recognized reading the corpus to be tested in the auxiliary language.

Based on the above-mentioned related content of step 22, if the user to be recognized can use the target language and the auxiliary language for expression, after it is determined that the user to be recognized has the pronunciation defect according to the test voice of the target language, the user to be recognized can read the preset expected to be tested again by using the auxiliary language to obtain the test voice of the auxiliary language, so that the occurrence reason of the pronunciation defect of the user to be recognized can be determined according to the test voice of the auxiliary language in the following.

Step 23: and generating the voice characteristics of the auxiliary language test voice according to the auxiliary language test voice.

The voice characteristics of the auxiliary language test voice are used for representing pronunciation information carried by the auxiliary language test voice.

In addition, the phonetic features of the test speech of the auxiliary language may include acoustic features of the test speech of the auxiliary language, phoneme features of the test speech of the auxiliary language, and pronunciation attribute features of the test speech of the auxiliary language. The acoustic characteristics of the auxiliary language test voice are used for representing acoustic information carried by the auxiliary language test voice; the phoneme characteristics of the auxiliary language test voice are used for representing phoneme information carried by the auxiliary language test voice; the pronunciation attribute feature of the auxiliary language test voice is used for representing pronunciation attribute information carried by the auxiliary language test voice.

It should be noted that, step 23 may be implemented by using any embodiment of obtaining the speech features of the speech sample in the auxiliary language (i.e., the language of the speech sample is the auxiliary language) in S12, and only the "speech sample" of any embodiment of obtaining the speech features of the speech sample in the auxiliary language in S12 needs to be replaced by the "test speech in the auxiliary language". For example, in a possible implementation, step 23 may specifically include steps 231-233:

step 231: and extracting acoustic features of the test speech of the auxiliary language from the test speech of the auxiliary language.

It should be noted that, step 231 may be implemented by any embodiment of S121, and only "voice sample" in any embodiment of S121 needs to be replaced with "auxiliary language test voice".

Step 232: and determining the phoneme characteristics of the test speech of the auxiliary language and the pronunciation attribute characteristics of the test speech of the auxiliary language according to the acoustic characteristics of the test speech of the auxiliary language.

It should be noted that, step 232 may be implemented by any embodiment of S1222, and only the "voice sample" in any embodiment of S1222 needs to be replaced by the "auxiliary language test voice". For example, in one possible implementation, step 232 may specifically be: and obtaining the phoneme characteristic of the auxiliary language test voice and the pronunciation attribute characteristic of the auxiliary language test voice according to the acoustic characteristic of the auxiliary language test voice and the pre-trained second model.

The content of the "second model" refers to the content of the "second model" in S1222 above.

In addition, the phoneme feature of the test speech of the auxiliary language is not limited in the embodiments of the present application, for example, when the language of the test speech of the auxiliary language is the jth language, the phoneme feature of the test speech of the auxiliary language may be a predicted phoneme that is obtained and output by the second model through the phoneme recognition in the jth language according to the acoustic feature of the test speech of the auxiliary language. For another example, when the language of the test speech of the auxiliary language is jth language, the phoneme feature of the test speech of the auxiliary language may be inputted into pth when the first model performs phoneme recognition in jth language according to the acoustic feature of the test speech of the auxiliary language_jAnd outputting the data of the sub-output layer. Wherein j is a positive integer and is less than or equal to N.

In addition, the pronunciation attribute feature of the test speech in the auxiliary language is not limited in the embodiment of the present application, for example, when the language of the test speech in the auxiliary language is the jth language, the pronunciation attribute feature of the test speech in the auxiliary language may be a predicted pronunciation attribute obtained and output by the second model performing pronunciation attribute recognition in the jth language according to the acoustic feature of the test speech in the auxiliary language. For another example, when the language of the test speech of the auxiliary language is jth language, the pronunciation attribute feature of the test speech of the auxiliary language may be input to the a-th language when the pronunciation attribute recognition of the jth language is performed by the first model according to the acoustic feature of the test speech of the auxiliary language_jAnd outputting the data of the sub-output layer. Wherein j is a positive integer and is less than or equal to N.

To facilitate understanding of the phoneme characteristics and pronunciation attributes of the test speech in the auxiliary language, the following description is provided with reference to examples.

As an example, when the second model includes N language phoneme recognition layers and N language pronunciation attribute recognition layers, the jth language phoneme recognition layer includes the pth_jA sub-output layer, a jth language pronunciation attribute identification layer including A_jIn the sub-output layer, j is a positive integer, j is not greater than N, and when N is a positive integer, step 232 may specifically be: if the language of the auxiliary language test voice is the jth language, the auxiliary language is firstly usedThe acoustic characteristics of the test voice and the language identification of the auxiliary language test voice are input into the second model to obtain the pth_jInput data and A-th of sub-output layer_jInput data of the sub-output layer; then will P_jDetermining the input data of the sub-output layer as the phoneme characteristics of the auxiliary language test voice, and determining the A < th > order_jAnd determining the input data of the sub-output layer as pronunciation attribute characteristics of the auxiliary language test voice.

Based on the related content of step 232, after the acoustic feature of the auxiliary language test speech is obtained, the second model may perform phoneme recognition and pronunciation attribute recognition according to the acoustic feature of the auxiliary language test speech to obtain a phoneme feature and a pronunciation attribute feature of the auxiliary language test speech.

Step 233: and determining the voice characteristics of the auxiliary language test voice according to the acoustic characteristics of the auxiliary language test voice, the phoneme characteristics of the auxiliary language test voice and the pronunciation attribute characteristics of the auxiliary language test voice.

It should be noted that, step 233 may be implemented by any embodiment of S123, and only "voice sample" in any embodiment of S123 needs to be replaced with "auxiliary language test voice".

Based on the related contents in steps 231 to 233, after the auxiliary language test voice is obtained, the acoustic feature of the auxiliary language test voice may be determined according to the auxiliary language test voice; determining the phoneme characteristics and pronunciation attribute characteristics of the auxiliary language test voice according to the acoustic characteristics of the auxiliary language test voice; and finally, determining the voice characteristics of the auxiliary language test voice according to the acoustic characteristics, the phoneme characteristics and the pronunciation attribute characteristics of the auxiliary language test voice, so that the voice characteristics of the auxiliary language test voice can represent the acoustic information, the phoneme information and the pronunciation attribute information of the auxiliary language test voice, and the voice characteristics of the auxiliary language test voice can more comprehensively represent the pronunciation information carried by the auxiliary language test voice.

Step 24: and generating the fusion characteristics of the target language test voice according to the voice characteristics of the target language test voice and the voice characteristics of the auxiliary language test voice.

The fusion characteristic of the target language test voice is used for representing voice information carried by the target language test voice and voice information carried by the auxiliary language test voice.

In addition, the embodiment of the present application is not limited to the implementation of step 24, and may be implemented by any existing or future feature fusion method. For example, in one possible implementation, step 24 may specifically include steps 241 to 244:

step 241: and obtaining the target language probability of the target language test voice, the auxiliary language probability of the target language test voice, the target language probability of the auxiliary language test voice and the auxiliary language probability of the auxiliary language test voice according to the acoustic characteristics of the target language test voice, the acoustic characteristics of the auxiliary language test voice and the pre-trained language classification model.

Wherein, the language classification model is used for language identification; moreover, the language classification model is not limited in the embodiment of the application, and any model which can be used for language identification in the prior art or in the future can be adopted for implementation.

In addition, the embodiment of the present application also does not limit the training process of the language classification model, for example, the language classification model may be obtained according to the acoustic features of the classified training speech and the actual language training of the classified training speech. Wherein, the classified training voice refers to voice data required to be used when a language classification model is trained; and the language of the classified training voice can be set according to the application scene of the language classification model. In addition, the acoustic features of the classified training speech serve as the role of the input data of the language classification model in the training process of the language classification model; and the actual language of the classified training speech serves as the role of the label information of the language classification model in the training process of the language classification model, so that the actual language of the classified training speech can guide the updating process of the language classification model.

The target language probability of the target language test voice is the possibility that the language classification model determines the language of the target language test voice as the target language.

The auxiliary language probability of the target language test voice is the possibility that the language classification model determines the language of the target language test voice as the auxiliary language.

The target language probability of the auxiliary language test voice is the possibility that the language classification model determines the language of the auxiliary language test voice as the target language.

The auxiliary language probability of the auxiliary language test voice refers to the possibility that the language classification model determines the language of the auxiliary language test voice as the auxiliary language.

In addition, the embodiment of the present application is not limited to the implementation of step 241, for example, in a possible implementation, step 241 may specifically include steps 2411 to 2412:

step 2411: and inputting the acoustic characteristics of the target language test voice into the language classification model to obtain the target language probability of the target language test voice and the auxiliary language probability of the target language test voice output by the language classification model.

Step 2412: and inputting the acoustic characteristics of the auxiliary language test voice into the language classification model to obtain the target language probability of the auxiliary language test voice and the auxiliary language probability of the auxiliary language test voice output by the language classification model.

Based on the related content of the above step 241, after obtaining the voice feature of the target language test voice and the voice feature of the auxiliary language test voice, the language classification model may perform language classification according to the voice feature of the target language test voice and the voice feature of the auxiliary language test voice, respectively, to obtain the target language probability and the auxiliary language probability of the target language test voice, and the target language probability and the auxiliary language probability of the auxiliary language test voice.

Step 242: and determining the feature fusion weight of the target language test voice according to the target language probability of the target language test voice and the target language probability of the auxiliary language test voice.

The feature fusion weight of the target language test voice is an influence proportion occupied by the voice feature of the target language test voice when the voice feature of the target language test voice and the voice feature of the auxiliary language test voice are fused.

In addition, the embodiment of the present application does not limit the calculation manner of the feature fusion weight of the target language test voice, for example, the average value of the target language probability of the target language test voice and the target language probability of the auxiliary language test voice may be determined as the feature fusion weight of the target language test voice.

Step 243: and determining the feature fusion weight of the auxiliary language test voice according to the auxiliary language probability of the target language test voice and the auxiliary language probability of the auxiliary language test voice.

The feature fusion weight of the auxiliary language test voice is an influence proportion occupied by the voice feature of the auxiliary language test voice when the voice feature of the target language test voice and the voice feature of the auxiliary language test voice are fused.

In addition, the embodiment of the present application does not limit the calculation manner of the feature fusion weight of the auxiliary language test speech, for example, the average value of the auxiliary language probability of the target language test speech and the auxiliary language probability of the auxiliary language test speech may be determined as the feature fusion weight of the auxiliary language test speech.

Step 244: and generating the fusion characteristic of the target language test voice according to the voice characteristic of the target language test voice, the characteristic fusion weight of the target language test voice, the voice characteristic of the auxiliary language test voice and the characteristic fusion weight of the auxiliary language test voice.

In this embodiment of the application, after the voice feature of the target language test voice, the feature fusion weight of the target language test voice, the voice feature of the assistant language test voice, and the feature fusion weight of the assistant language test voice are obtained, the voice feature of the target language test voice and the voice feature of the assistant language test voice may be weighted and fused according to the feature fusion weight of the target language test voice and the feature fusion weight of the assistant language test voice, so as to obtain the fusion feature of the target language test voice, so that the fusion feature carries pronunciation information in the target language test voice and pronunciation information in the assistant language test voice.

Step 25: and inputting the fusion characteristics of the target language test voice into the pronunciation defect recognition model to obtain a second recognition result output by the pronunciation defect recognition model.

The second recognition result can be used for representing whether the fusion feature of the target language test voice carries the pronunciation problem or not, and can also be used for representing whether the language of the auxiliary language test voice has excessive influence on the user to be recognized or not.

In addition, the embodiment of the present application does not limit the manner of representing the second recognition result, for example, the second recognition result may be represented by using preset characters, which specifically include: if the second recognition result is '1', the fusion characteristic of the target language test voice is represented to carry a pronunciation problem, so that the pronunciation problem carried in the target language test voice is not caused by the excessive influence of the auxiliary language, and the pronunciation problem carried in the target language test voice is determined to be caused by the body problem of the user to be recognized; if the first recognition result is '0', it indicates that the fusion feature of the target language test voice does not carry the pronunciation problem, so that it can indicate that the pronunciation problem carried in the target language test voice is generated due to the excessive influence of the auxiliary language. For another example, the second recognition result may be represented by a probability value, which is specifically: the second recognition result may include a third probability value and/or a fourth probability value, where the third probability value represents a possibility that the fusion feature of the target language test voice carries a pronunciation problem (may also represent a possibility that the language of the auxiliary language test voice has an excessive influence on the user to be recognized), and the fourth probability value represents a possibility that the fusion feature of the target language test voice does not carry a pronunciation problem (may also represent a possibility that the language of the auxiliary language test voice does not have an excessive influence on the user to be recognized).

Step 26: and determining the pronunciation defect generation reason of the user to be recognized according to the second recognition result.

In the embodiment of the application, after the second recognition result is obtained, the cause of the pronunciation defect of the user to be recognized may be determined according to the second recognition result, and specifically, the cause may be: judging whether the second recognition result meets a second condition, if so, determining the first reason as the pronunciation defect generation reason of the user to be recognized; if not, the second reason can be determined as the pronunciation defect generation reason of the user to be identified.

Wherein, the second condition can be preset; also, the embodiment of the present application does not limit the second condition, for example, when the second recognition result may be represented by a preset character, the second condition may be set to "the first recognition result is 1". For another example, when the second recognition result can be represented by a probability value, the second condition can be set to "the first probability value ≧ the second probability threshold" and/or "the second probability value < the second probability threshold". The second probability threshold may be preset, for example, the second probability threshold may be 0.5.

The first reason is that the language of the auxiliary language test voice excessively affects the user to be recognized.

The second reason refers to other reasons than the first reason; and the present embodiment does not limit the second reason. For example, the second cause may be a physical problem of the user to be identified.

Based on the related contents of the above steps 21 to 26, after it is determined that the first recognition result satisfies the first condition, acquiring an auxiliary language test voice of the user to be recognized and a voice feature thereof, then fusing the voice feature of the auxiliary language test voice and the voice feature of the target language test voice to obtain a fusion feature of the target language test voice, inputting the fusion feature of the target language test voice into the pronunciation defect recognition model to obtain a second recognition result output by the pronunciation defect recognition model, so that the second recognition result can accurately represent whether the pronunciation error carried in the target language test voice is caused by the excessive influence of the auxiliary language; and finally, determining the pronunciation defect generation reason of the user to be recognized according to the second recognition result.

It should be noted that, the embodiment of the present application does not limit the conclusion whether the pronunciation defect exists in the user to be recognized and the subsequent application of the reason for generating the pronunciation defect of the user to be recognized. For example, the conclusion that whether the user to be recognized has pronunciation defects or not and the reason for generating pronunciation defects of the user to be recognized can be used for assisting a doctor in diagnosing whether the user to be recognized has dysarthria or not; and dysarthria can be determined according to the conclusion whether the user to be recognized has pronunciation defects or not, the pronunciation defects of the user to be recognized and the inspection result of the pronunciation parts (such as pharynx, tongue and the like) of the user to be recognized.

In some cases, it is necessary to obtain pronunciation errors carried in the test speech of the target language, so to meet this requirement, the present application provides another possible implementation manner of the pronunciation defect identification method, in which the pronunciation defect identification method includes, in addition to all or part of the above steps, steps 31 to 34:

step 31: and generating the phonemes of the target language test voice and the pronunciation attributes of the target language test voice according to the target language test voice.

The phonemes of the target language test voice are used to describe the phoneme information carried in the target language test voice (i.e., the phonemes used by the user to be recognized to record the target language test voice).

The pronunciation attribute of the target language test voice is used to describe pronunciation attribute information carried in the target language test voice (i.e., pronunciation attribute used by the user to be recognized to record the target language test voice).

The embodiment of the present application is not limited to the implementation of step 31, for example, in a possible implementation, step 31 may specifically include steps 311 to 312:

step 311: and extracting the acoustic features of the target language test voice from the target language test voice.

It should be noted that, step 311 may be implemented by any embodiment of the foregoing S121, and only "the voice sample" in any embodiment of the foregoing S121 needs to be replaced with "the target language test voice".

Step 312: and inputting the acoustic characteristics of the target language test voice into a pre-constructed first model to obtain the phonemes of the target language test voice output by the first model and the pronunciation attributes of the target language test voice.

Based on the related contents in the above steps 311 to 312, after the target language test voice is obtained, the acoustic feature of the target language test voice may be determined according to the target language test voice; and inputting the acoustic characteristics of the target language test voice into a trained first model so that the first model performs phoneme recognition and pronunciation attribute recognition according to the acoustic characteristics of the target language test voice to obtain and output phonemes and pronunciation attributes of the target language test voice.

Step 32: and comparing the phoneme of the target language test voice with the first standard phoneme to obtain a comparison result of the target language phoneme.

The first standard phoneme comprises an actual phoneme corresponding to the target language test voice; for example, if the target language test speech is obtained by a user to be recognized reading the corpus to be tested in the target language, the first standard phoneme may be determined according to the corpus to be tested and a phoneme table in the target language, so that the first standard phoneme is used to indicate a phoneme that should be used when reading the corpus to be tested in the standard target language.

The target language phoneme comparison result is used for describing the same and/or different parts between the phonemes of the target language test speech and the first standard phoneme.

Step 33: and comparing the pronunciation attribute of the target language test voice with the first standard pronunciation attribute to obtain a comparison result of the target language pronunciation attribute.

The first standard pronunciation attribute comprises an actual pronunciation attribute corresponding to the target language test voice; for example, if the target language test voice is obtained by a user to be recognized reading a corpus to be tested in a target language, the first standard pronunciation attribute may be determined according to the corpus to be tested and a pronunciation attribute table in the target language, so that the first standard pronunciation attribute is used to indicate a pronunciation attribute that should be used when reading the corpus to be tested in the standard target language.

The comparison result of the target language pronunciation attributes is used for describing the same and/or different points between the pronunciation attributes of the target language test voice and the first standard pronunciation attributes.

Step 34: and determining the pronunciation error of the target language according to the comparison result of the phonemes of the target language and the comparison result of the pronunciation attributes of the target language.

The target language pronunciation error is used for describing the wrong pronunciation of the user to be recognized when recording the target language test voice.

In the embodiment of the present application, after the comparison result of the target language phoneme and the comparison result of the target language pronunciation attribute are obtained, the target language pronunciation error may be determined according to the comparison result of the target language phoneme and the comparison result of the target language pronunciation attribute, which specifically is: and determining the pronunciation error of the target language according to the difference recorded in the comparison result of the phonemes of the target language and the difference recorded in the comparison result of the pronunciation attributes of the target language.

Based on the related contents in the above steps 31 to 34, in the embodiment of the present application, after the target language test speech is obtained, the first model may be used to perform phoneme recognition and pronunciation attribute recognition on the target language test speech, so as to obtain and output the phonemes of the target language test speech and the pronunciation attributes of the target language test speech; and determining a target language pronunciation error according to a comparison result between the phoneme of the target language test voice and the first standard phoneme and a comparison result between the pronunciation attribute of the target language test voice and the first standard pronunciation attribute respectively, so that the target language pronunciation error can accurately represent the wrong pronunciation of the user to be identified when the target language test voice is recorded.

In some cases, if the user to be recognized grasps the target language and the auxiliary language, the pronunciation error of the user to be recognized may include a pronunciation error of the user to be recognized in the target language and a pronunciation error of the user to be recognized in the auxiliary language. Based on this, the present application provides still another possible implementation manner of the pronunciation defect identification method, in which the pronunciation defect identification method includes, in addition to all or part of the above steps, step 41-step 43:

step 41: and generating the phonemes of the target language test voice and the pronunciation attributes of the target language test voice according to the target language test voice.

It should be noted that, the relevant content of step 41 refers to the relevant content of step 31 above.

Step 42: and generating the phonemes of the auxiliary language test voice and the pronunciation attributes of the auxiliary language test voice according to the auxiliary language test voice.

Wherein, the phoneme of the test speech of the auxiliary language is used to describe the phoneme information carried in the test speech of the auxiliary language (i.e. the phoneme used by the user to be recognized to record the test speech of the auxiliary language).

The pronunciation attributes of the test speech of the auxiliary language are used to describe pronunciation attribute information carried in the test speech of the auxiliary language (i.e., pronunciation attributes used by the user to be recognized to record the test speech of the auxiliary language).

The embodiment of the present application is not limited to the implementation of step 42, for example, in a possible implementation, step 42 may specifically include steps 421 to 422:

step 421: and extracting acoustic features of the test speech of the auxiliary language from the test speech of the auxiliary language.

It should be noted that, step 421 may be implemented by any embodiment of S121, and only "voice sample" in any embodiment of S121 needs to be replaced with "auxiliary language test voice".

Step 422: and inputting the acoustic characteristics of the auxiliary language test voice and the language identification of the auxiliary language test voice into a pre-constructed second model to obtain the phonemes of the auxiliary language test voice output by the second model and the pronunciation attributes of the auxiliary language test voice.

It should be noted that the embodiment of the present application is not limited to the implementation of step 422, for example, when the second model includes N language phoneme recognition layers and N language pronunciation attribute recognition layers, step 422 may specifically be: if the language of the auxiliary language test voice is the jth language, inputting the acoustic features of the auxiliary language test voice and the language identification of the auxiliary language test voice into the second model to obtain the phonemes of the auxiliary language test voice output by the jth language phoneme recognition layer in the second model and the pronunciation attributes of the auxiliary language test voice output by the jth language pronunciation attribute recognition layer in the second model.

Based on the related contents of the above steps 421 to 422, after the auxiliary language test voice is obtained, the acoustic feature of the auxiliary language test voice may be determined according to the auxiliary language test voice; and inputting the acoustic characteristics of the auxiliary language test voice into a trained second model so that the second model carries out phoneme recognition and pronunciation attribute recognition according to the acoustic characteristics of the auxiliary language test voice to obtain and output phonemes and pronunciation attributes of the auxiliary language test voice.

Step 43: and determining the pronunciation error of the user to be recognized according to the phoneme of the target language test voice, the pronunciation attribute of the target language test voice, the phoneme of the auxiliary language test voice, the pronunciation attribute of the auxiliary language test voice, the first standard phoneme, the first standard pronunciation attribute, the second standard phoneme and the second standard pronunciation attribute.

Wherein, the content related to the "first phone" refers to the content related to the "first phone" in step 32; the content of the "first standard pronunciation attribute" refers to the content of the "first standard pronunciation attribute" in step 32 above.

The second standard phoneme comprises an actual phoneme corresponding to the auxiliary language test voice; for example, if the auxiliary language test speech is obtained by a user to be recognized reading a corpus to be tested using an auxiliary language, the second standard phoneme may be determined according to the corpus to be tested and a phoneme table in the auxiliary language, so that the second standard phoneme is used to indicate a phoneme that should be used when reading the corpus to be tested according to the standard auxiliary language.

The second standard pronunciation attribute comprises an actual pronunciation attribute corresponding to the auxiliary language test voice; for example, if the auxiliary language test speech is obtained by the user to be recognized using the auxiliary language to read the corpus to be tested, the second standard pronunciation attribute may be determined according to the corpus to be tested and the pronunciation attribute table in the auxiliary language, so that the second standard phoneme is used to indicate the pronunciation attribute that should be used when reading the corpus to be tested according to the standard auxiliary language.

The pronunciation error of the user to be recognized is used for describing the wrong pronunciation of the user to be recognized when recording the target language test voice and the auxiliary language test voice; and the pronunciation error of the user to be recognized may include a phoneme error of the user to be recognized and/or a pronunciation attribute error of the user to be recognized.

The phoneme errors of the user to be recognized are used for describing error phonemes which occur when the user to be recognized records the target language test voice and the auxiliary language test voice. In addition, the embodiment of the present application does not limit the determination process of the phoneme error of the user to be identified, for example, in a possible implementation manner, the determination process of the phoneme error of the user to be identified may include steps 51 to 53:

step 51: and comparing the phoneme of the target language test voice with the first standard phoneme to obtain a comparison result of the target language phoneme.

It should be noted that the relevant content of step 51 refers to the relevant content of step 32 above.

Step 52: and comparing the phoneme of the auxiliary language test voice with the second standard phoneme to obtain an auxiliary language phoneme comparison result.

The comparison result of the auxiliary language phoneme is used to describe the identity and/or difference between the phoneme of the auxiliary language test speech and the second standard phoneme.

Step 53 determines the phoneme error of the user to be identified according to the comparison result of the target language phoneme and the comparison result of the auxiliary language phoneme.

In the embodiment of the application, after the target language phoneme comparison result and the auxiliary language phoneme comparison result are obtained, the phoneme error of the user to be identified can be determined according to the target language phoneme comparison result and the auxiliary language phoneme comparison result; and the determination process may specifically be: and determining the phoneme error of the user to be recognized according to the difference recorded in the target language phoneme comparison result and the difference recorded in the auxiliary language phoneme comparison result, so that the phoneme error is used for representing the error phoneme of the user to be recognized when recording the target language test voice and the auxiliary language test voice.

The pronunciation attribute error of the user to be recognized is used for describing the wrong pronunciation attribute of the user to be recognized when recording the target language test voice and the auxiliary language test voice. In addition, the embodiment of the present application does not limit the determination process of the pronunciation attribute error of the user to be recognized, for example, in a possible implementation, the determination process of the pronunciation attribute error of the user to be recognized may include steps 61 to 63:

step 61: and comparing the pronunciation attribute of the target language test voice with the first standard pronunciation attribute to obtain a comparison result of the target language pronunciation attribute.

It should be noted that, the relevant content of step 61 refers to the relevant content of step 33 above.

Step 62: and comparing the pronunciation attribute of the auxiliary language test voice with the second standard pronunciation attribute to obtain an auxiliary language pronunciation attribute comparison result.

And the auxiliary language pronunciation attribute comparison result is used for describing the same and/or different points between the pronunciation attribute of the auxiliary language test voice and the second standard pronunciation attribute.

And step 63, determining that the pronunciation attribute of the user to be identified is wrong according to the comparison result of the target language pronunciation attribute and the comparison result of the auxiliary language pronunciation attribute.

In the embodiment of the application, after the comparison result of the target language pronunciation attribute and the comparison result of the auxiliary language pronunciation attribute are obtained, the pronunciation attribute error of the user to be identified can be determined according to the comparison result of the target language pronunciation attribute and the comparison result of the auxiliary language pronunciation attribute; and the determination process may specifically be: and determining the pronunciation attribute error of the user to be recognized according to the difference recorded in the comparison result of the target language pronunciation attribute and the difference recorded in the comparison result of the auxiliary language pronunciation attribute, so that the pronunciation attribute error is used for representing the wrong pronunciation attribute of the user to be recognized when recording the target language test voice and the auxiliary language test voice.

Based on the above-mentioned related contents of steps 41 to 43, if the user to be recognized can use the target language and the auxiliary language for expression, the pronunciation error of the user to be recognized can be determined according to the target language test voice and the auxiliary language test voice recorded by the user to be recognized, so that the pronunciation error of the user to be recognized can accurately describe the mispronunciations (e.g., mispronunciations and/or mispronunciations attributes) of the user to be recognized occurring when the target language test voice and the auxiliary language test voice are recorded, so that the user to be recognized can subsequently perform corresponding processing (e.g., display to the user to be recognized, recommendation of pronunciation correction exercise questions, etc.) according to the pronunciation error of the user to be recognized.

Based on the pronunciation defect recognition model training method provided by the above method embodiment, the embodiment of the present application further provides a pronunciation defect recognition model training device, which is explained and explained below with reference to the accompanying drawings.

Apparatus embodiment one

The embodiment of the apparatus introduces the pronunciation defect recognition model training apparatus, and please refer to the above method embodiment for related contents.

Referring to fig. 11, the drawing is a schematic structural diagram of a pronunciation defect recognition model training device according to an embodiment of the present application.

The pronunciation defect recognition model training device 1100 provided by the embodiment of the application comprises

A first obtaining unit 1101, configured to obtain a voice sample and a defect label of the voice sample; wherein the defect label of the voice sample is used for indicating whether the provider of the voice sample has pronunciation defect;

a first generating unit 1102, configured to generate a speech feature of the speech sample according to the speech sample; wherein the speech features include acoustic features, phoneme features, and pronunciation attribute features;

and the model training unit 1103 is configured to train a pronunciation defect recognition model according to the voice features of the voice samples and the defect labels of the voice samples.

In a possible implementation, the first generating unit 1102 includes:

a first extraction subunit, configured to extract an acoustic feature of the voice sample from the voice sample;

the second determining subunit is used for determining the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample according to the acoustic characteristics of the voice sample;

and the third determining subunit is used for determining the voice characteristics of the voice sample according to the acoustic characteristics of the voice sample, the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample.

In a possible implementation, the second determining subunit includes:

the fourth determining subunit is configured to, when the language of the voice sample is the target language, obtain a phoneme feature of the voice sample and a pronunciation attribute feature of the voice sample according to the acoustic feature of the voice sample and a pre-trained first model; the first model is obtained by training according to the acoustic features of the target language training voice, the actual phonemes of the target language training voice and the actual pronunciation attributes of the target language training voice;

a fifth determining subunit, configured to, when the language of the voice sample is an auxiliary language, obtain a phoneme feature of the voice sample and a pronunciation attribute feature of the voice sample according to an acoustic feature of the voice sample and a pre-trained second model; the second model is obtained by training according to the acoustic features of the auxiliary language training voice, the actual phonemes of the auxiliary language training voice and the actual pronunciation attributes of the auxiliary language training voice.

In one possible implementation, the first model includes a target language phoneme recognition layer and a target language pronunciation attribute recognition layer, the target language phoneme recognition layer includes a first sub-output layer, and the target language pronunciation attribute recognition layer includes a second sub-output layer;

the fourth determining subunit is specifically configured to: inputting the acoustic features of the voice sample into the first model to obtain first input data of the first sub-output layer and first input data of the second sub-output layer; determining first input data of the first sub-output layer as a phoneme feature of the voice sample; and determining the first input data of the second sub-output layer as the pronunciation attribute characteristics of the voice sample.

In one possible embodiment, the second model comprises N language phoneme recognition layers and N language pronunciation attribute recognition layers, and the ith language phoneme recognition layer comprises the pth_iThe ith language pronunciation attribute recognition layer comprises an A_iA sub-output layer, i is a positive integer, i is not more than N, and N is a positive integer;

the fifth determining subunit is specifically configured to: when the auxiliary language is the ith language, inputting the acoustic characteristics of the voice sample and the language identification of the voice sample into the second model to obtain the pth_iInput data of sub-output layer and the A < th > layer_iInput data of the sub-output layer; the P < th > is_iDetermining input data of a sub-output layer as phoneme characteristics of the voice sample; subjecting the first step A to_iAnd determining the input data of the sub-output layer as the pronunciation attribute characteristics of the voice sample.

Based on the pronunciation defect recognition method provided by the above method embodiment, the embodiment of the present application further provides a pronunciation defect recognition device, which is explained and explained below with reference to the accompanying drawings.

Device embodiment II

The embodiment of the device introduces the pronunciation defect recognition device, and please refer to the above embodiment of the method for the relevant content.

Referring to fig. 12, the drawing is a schematic structural diagram of a pronunciation defect recognition apparatus according to an embodiment of the present application.

The pronunciation defect recognition apparatus 1200 provided in the embodiment of the present application includes:

a second obtaining unit 1201, configured to obtain a target language test voice of a user to be recognized;

a second generating unit 1202, configured to generate a speech feature of the target language test speech according to the target language test speech; wherein the speech features include acoustic features, phoneme features, and pronunciation attribute features;

a first recognition unit 1203, configured to input a speech feature of the target language test speech into a pre-trained pronunciation defect recognition model, so as to obtain a first recognition result output by the pronunciation defect recognition model; the pronunciation defect recognition model is trained by any implementation mode of the pronunciation defect recognition model training method provided by the embodiment of the application;

a first determining unit 1204, configured to determine whether the user to be identified has a pronunciation defect according to the first recognition result.

In a possible implementation, the second generating unit 1202 includes:

a sixth determining subunit, configured to extract an acoustic feature of the target language test speech from the target language test speech;

a seventh determining subunit, configured to determine, according to the acoustic feature of the target language test speech, a phoneme feature of the target language test speech and a pronunciation attribute feature of the target language test speech;

and the eighth determining subunit is configured to determine the speech feature of the target language test speech according to the acoustic feature of the target language test speech, the phoneme feature of the target language test speech, and the pronunciation attribute feature of the target language test speech.

In a possible implementation manner, the seventh determining subunit is specifically configured to:

obtaining the phoneme characteristic of the target language test voice and the pronunciation attribute characteristic of the target language test voice according to the acoustic characteristic of the target language test voice and a pre-trained first model; the first model is obtained by training according to the acoustic features of the target language training voice, the actual phonemes of the target language training voice and the actual pronunciation attributes of the target language training voice.

the seventh determining subunit is specifically configured to: inputting the acoustic features of the target language test voice into the first model to obtain second input data of the first sub-output layer and second input data of the second sub-output layer; determining second input data of the first sub-output layer as phoneme characteristics of the target language test voice; and determining second input data of the second sub-output layer as pronunciation attribute characteristics of the target language test voice.

In a possible implementation, the pronunciation defect recognition apparatus 1200 further includes:

a third generating unit, configured to generate phonemes of the target language test speech and pronunciation attributes of the target language test speech according to the target language test speech;

the first comparison unit is used for comparing the phoneme of the target language test voice with the first standard phoneme to obtain a target language phoneme comparison result;

the second comparison unit is used for comparing the pronunciation attribute of the target language test voice with the first standard pronunciation attribute to obtain a comparison result of the pronunciation attribute of the target language;

and the second determining unit is used for determining the pronunciation error of the target language according to the comparison result of the target language phoneme and the comparison result of the pronunciation attributes of the target language.

In a possible implementation, the third generating unit is specifically configured to: extracting acoustic features of the target language test voice from the target language test voice; inputting the acoustic features of the target language test voice into a pre-constructed first model to obtain the phonemes of the target language test voice output by the first model and the pronunciation attributes of the target language test voice; the first model is obtained by training according to the acoustic features of the target language training voice, the actual phonemes of the target language training voice and the actual pronunciation attributes of the target language training voice.

the third determining unit is used for determining that the user to be recognized has pronunciation defects when the first recognition result is determined to meet the first condition;

the third acquisition unit is used for acquiring the auxiliary language test voice of the user to be recognized;

a fourth generating unit, configured to generate a speech feature of the auxiliary language test speech according to the auxiliary language test speech;

a fifth generating unit, configured to generate a fusion feature of the target language test voice according to the voice feature of the target language test voice and the voice feature of the auxiliary language test voice;

the second recognition unit is used for inputting the fusion characteristics of the target language test voice into the pronunciation defect recognition model to obtain a second recognition result output by the pronunciation defect recognition model;

and the fourth determining unit is used for determining the pronunciation defect generation reason of the user to be recognized according to the second recognition result.

In a possible implementation, the fourth generating unit includes:

a ninth determining subunit, configured to extract an acoustic feature of the auxiliary language test speech from the auxiliary language test speech;

a tenth determining subunit, configured to determine, according to the acoustic feature of the auxiliary language test speech, a phoneme feature of the auxiliary language test speech and a pronunciation attribute feature of the auxiliary language test speech;

an eleventh determining subunit, configured to determine a speech feature of the auxiliary language test speech according to the acoustic feature of the auxiliary language test speech, the phoneme feature of the auxiliary language test speech, and the pronunciation attribute feature of the auxiliary language test speech.

In a possible implementation manner, the tenth determining subunit is specifically configured to: obtaining the phoneme characteristic of the auxiliary language test voice and the pronunciation attribute characteristic of the auxiliary language test voice according to the acoustic characteristic of the auxiliary language test voice and a pre-trained second model; the second model is obtained by training according to the acoustic features of the auxiliary language training voice, the actual phonemes of the auxiliary language training voice and the actual pronunciation attributes of the auxiliary language training voice.

In one possible embodiment, the second model comprises N language phoneme recognition layers and N language pronunciation attribute recognition layers, and the jth language phoneme recognition layer comprises the pth_jA sub-output layer, a jth language pronunciation attribute identification layer including A_jA sub-output layer, j is a positive integer, j is less than or equal to N, and N is a positive integer;

the tenth determining sub-listThe element is specifically configured to: when the language of the auxiliary language test voice is jth language, inputting the acoustic characteristics of the auxiliary language test voice and the language identification of the auxiliary language test voice into the second model to obtain the pth_jInput data of sub-output layer and the A < th > layer_jInput data of the sub-output layer; the P < th > is_jDetermining input data of a sub-output layer as phoneme characteristics of the auxiliary language test voice; subjecting the first step A to_jAnd determining the input data of the sub-output layer as the pronunciation attribute characteristics of the auxiliary language test voice.

In a possible implementation manner, the fifth generating unit is specifically configured to: obtaining a target language probability of the target language test voice, an auxiliary language probability of the target language test voice, a target language probability of the auxiliary language test voice and an auxiliary language probability of the auxiliary language test voice according to the acoustic features of the target language test voice, the acoustic features of the auxiliary language test voice and a pre-trained language classification model; the language classification model is obtained by training according to the acoustic characteristics of the classified training voice and the actual language of the classified training voice; determining the feature fusion weight of the target language test voice according to the target language probability of the target language test voice and the target language probability of the auxiliary language test voice; determining the feature fusion weight of the auxiliary language test voice according to the auxiliary language probability of the target language test voice and the auxiliary language probability of the auxiliary language test voice; and generating the fusion feature of the target language test voice according to the voice feature of the target language test voice, the feature fusion weight of the target language test voice, the voice feature of the auxiliary language test voice and the feature fusion weight of the auxiliary language test voice.

a sixth generating unit, configured to generate phonemes of the target language test speech and pronunciation attributes of the target language test speech according to the target language test speech;

a seventh generating unit, configured to generate phonemes of the auxiliary language test speech and pronunciation attributes of the auxiliary language test speech according to the auxiliary language test speech;

a fifth determining unit, configured to determine a pronunciation error of the user to be recognized according to the phoneme of the target language test speech, the pronunciation attribute of the target language test speech, the phoneme of the auxiliary language test speech, the pronunciation attribute of the auxiliary language test speech, the first standard phoneme, the first standard pronunciation attribute, the second standard phoneme, and the second standard pronunciation attribute.

In a possible implementation manner, the sixth generating unit is specifically configured to: extracting acoustic features of the auxiliary language test voice from the auxiliary language test voice; inputting the acoustic characteristics of the auxiliary language test voice and the language identification of the auxiliary language test voice into a pre-constructed second model to obtain the phonemes of the auxiliary language test voice output by the second model and the pronunciation attributes of the auxiliary language test voice; the second model is obtained by training according to the acoustic features of the auxiliary language training voice, the actual phonemes of the auxiliary language training voice and the actual pronunciation attributes of the auxiliary language training voice.

In a possible implementation manner, the pronunciation error of the user to be recognized comprises a phoneme error of the user to be recognized and/or a pronunciation attribute error of the user to be recognized;

the phoneme error determining process of the user to be identified comprises the following steps: comparing the phoneme of the target language test voice with the first standard phoneme to obtain a target language phoneme comparison result; comparing the phoneme of the auxiliary language test voice with the second standard phoneme to obtain an auxiliary language phoneme comparison result; determining phoneme errors of the user to be identified according to the target language phoneme comparison result and the auxiliary language phoneme comparison result;

the process for determining the pronunciation attribute error of the user to be identified comprises the following steps: comparing the pronunciation attribute of the target language test voice with the first standard pronunciation attribute to obtain a target language pronunciation attribute comparison result; comparing the pronunciation attribute of the auxiliary language test voice with the second standard pronunciation attribute to obtain an auxiliary language pronunciation attribute comparison result; and determining that the pronunciation attribute of the user to be identified is wrong according to the comparison result of the target language pronunciation attribute and the comparison result of the auxiliary language pronunciation attribute.

In a possible implementation manner, the first determining unit 1204 is specifically configured to: when the second recognition result second condition is determined, determining a first reason as the pronunciation defect generation reason of the user to be recognized; the first reason is that the language of the auxiliary language test voice has excessive influence on the user to be recognized.

Further, an embodiment of the present application further provides pronunciation defect recognition model training equipment, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is configured to store one or more programs, the one or more programs including instructions, which when executed by the processor, cause the processor to perform any one of the implementation methods of the pronunciation defect recognition model training method described above.

Further, an embodiment of the present application further provides a pronunciation defect recognition apparatus, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, and the one or more programs comprise instructions which when executed by the processor cause the processor to execute any implementation method of the pronunciation defect identification method.

Further, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a terminal device, the terminal device is caused to execute any implementation method of the pronunciation defect recognition model training method or execute any implementation method of the pronunciation defect recognition method.

Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation method of the pronunciation defect recognition model training method or execute any implementation method of the pronunciation defect recognition method.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A pronunciation defect recognition model training method is characterized by comprising the following steps:

2. The method of claim 1, wherein generating the speech features of the speech sample from the speech sample comprises:

extracting acoustic features of the voice sample from the voice sample;

determining phoneme characteristics of the voice sample and pronunciation attribute characteristics of the voice sample according to the acoustic characteristics of the voice sample;

and determining the voice characteristics of the voice sample according to the acoustic characteristics of the voice sample, the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample.

3. The method according to claim 2, wherein when the language of the speech sample is the target language, the determining the phoneme characteristics of the speech sample and the pronunciation attribute characteristics of the speech sample according to the acoustic characteristics of the speech sample comprises:

obtaining the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample according to the acoustic characteristics of the voice sample and a pre-trained first model; the first model is obtained by training according to the acoustic features of the target language training voice, the actual phonemes of the target language training voice and the actual pronunciation attributes of the target language training voice;

when the language of the voice sample is the auxiliary language, determining the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample according to the acoustic characteristics of the voice sample, including:

obtaining the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample according to the acoustic characteristics of the voice sample and a pre-trained second model; the second model is obtained by training according to the acoustic features of the auxiliary language training voice, the actual phonemes of the auxiliary language training voice and the actual pronunciation attributes of the auxiliary language training voice.

4. The method according to claim 3, wherein said first model comprises a target language phoneme recognition layer and a target language pronunciation attribute recognition layer, said target language phoneme recognition layer comprising a first sub-output layer, said target language pronunciation attribute recognition layer comprising a second sub-output layer;

the obtaining of the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample according to the acoustic characteristics of the voice sample and the pre-trained first model comprises:

inputting the acoustic features of the voice sample into the first model to obtain first input data of the first sub-output layer and first input data of the second sub-output layer;

determining first input data of the first sub-output layer as a phoneme feature of the voice sample;

and determining the first input data of the second sub-output layer as the pronunciation attribute characteristics of the voice sample.

5. The method of claim 3 wherein said second model includes N language phoneme recognition layers and N language pronunciation attribute recognition layers, and wherein the ith language phoneme recognition layer includes the pth_iThe ith language pronunciation attribute recognition layer comprises an A_iA sub-output layer, i is a positive integer, i is not more than N, and N is a positive integer;

when the auxiliary language is the ith language, obtaining the phoneme characteristics of the voice sample and the pronunciation attribute characteristics of the voice sample according to the acoustic characteristics of the voice sample and a pre-trained second model, including:

inputting the acoustic characteristics of the voice sample and the language identification of the voice sample into the second model to obtain the pth_iInput data of sub-output layer and the A < th > layer_iInput data of the sub-output layer;

the P < th > is_iDetermining input data of a sub-output layer as phoneme characteristics of the voice sample;

subjecting the first step A to_iAnd determining the input data of the sub-output layer as the pronunciation attribute characteristics of the voice sample.

6. A pronunciation defect recognition method, the method comprising:

acquiring a target language test voice of a user to be identified;

inputting the voice characteristics of the target language test voice into a pre-trained pronunciation defect recognition model to obtain a first recognition result output by the pronunciation defect recognition model; wherein the pronunciation defect recognition model is trained by the pronunciation defect recognition model training method of claims 1-5;

7. The method according to claim 6, wherein said generating speech features of said target language test speech based on said target language test speech comprises:

extracting acoustic features of the target language test voice from the target language test voice;

determining the phoneme characteristic of the target language test voice and the pronunciation attribute characteristic of the target language test voice according to the acoustic characteristic of the target language test voice;

and determining the voice characteristics of the target language test voice according to the acoustic characteristics of the target language test voice, the phoneme characteristics of the target language test voice and the pronunciation attribute characteristics of the target language test voice.

8. The method according to claim 7, wherein said determining phoneme characteristics of said test speech in said target language and pronunciation attribute characteristics of said test speech in said target language based on acoustic characteristics of said test speech in said target language comprises:

9. The method according to claim 8, wherein said first model comprises a target language phoneme recognition layer and a target language pronunciation attribute recognition layer, said target language phoneme recognition layer comprising a first sub-output layer, said target language pronunciation attribute recognition layer comprising a second sub-output layer;

the obtaining of the phoneme characteristics of the target language test voice and the pronunciation attribute characteristics of the target language test voice according to the acoustic characteristics of the target language test voice and the pre-trained first model comprises:

inputting the acoustic features of the target language test voice into the first model to obtain second input data of the first sub-output layer and second input data of the second sub-output layer;

determining second input data of the first sub-output layer as phoneme characteristics of the target language test voice;

and determining second input data of the second sub-output layer as pronunciation attribute characteristics of the target language test voice.

10. The method of claim 6, further comprising:

generating phonemes of the target language test voice and pronunciation attributes of the target language test voice according to the target language test voice;

comparing the phoneme of the target language test voice with the first standard phoneme to obtain a target language phoneme comparison result;

comparing the pronunciation attribute of the target language test voice with the first standard pronunciation attribute to obtain a target language pronunciation attribute comparison result;

and determining the pronunciation error of the target language according to the comparison result of the phonemes of the target language and the comparison result of the pronunciation attributes of the target language.

11. The method according to claim 10, wherein said generating phonemes for said target language test speech and pronunciation attributes for said target language test speech based on said target language test speech comprises:

inputting the acoustic features of the target language test voice into a pre-constructed first model to obtain the phonemes of the target language test voice output by the first model and the pronunciation attributes of the target language test voice; the first model is obtained by training according to the acoustic features of the target language training voice, the actual phonemes of the target language training voice and the actual pronunciation attributes of the target language training voice.

12. The method according to any one of claims 6-11, further comprising:

when the first recognition result is determined to meet a first condition, determining that the user to be recognized has pronunciation defects, and acquiring auxiliary language test voice of the user to be recognized;

generating the voice characteristics of the auxiliary language test voice according to the auxiliary language test voice;

generating a fusion characteristic of the target language test voice according to the voice characteristic of the target language test voice and the voice characteristic of the auxiliary language test voice;

inputting the fusion characteristics of the target language test voice into the pronunciation defect recognition model to obtain a second recognition result output by the pronunciation defect recognition model;

and determining the generation reason of the pronunciation defect of the user to be recognized according to the second recognition result.

13. The method according to claim 12, wherein said generating speech features of said test speech of an auxiliary language based on said test speech of an auxiliary language comprises:

extracting acoustic features of the auxiliary language test voice from the auxiliary language test voice;

determining the phoneme characteristic of the auxiliary language test voice and the pronunciation attribute characteristic of the auxiliary language test voice according to the acoustic characteristic of the auxiliary language test voice;

and determining the voice characteristics of the auxiliary language test voice according to the acoustic characteristics of the auxiliary language test voice, the phoneme characteristics of the auxiliary language test voice and the pronunciation attribute characteristics of the auxiliary language test voice.

14. The method according to claim 13, wherein said determining phoneme characteristics of said test speech of assistant language and pronunciation attribute characteristics of said test speech of assistant language according to acoustic characteristics of said test speech of assistant language comprises:

obtaining the phoneme characteristic of the auxiliary language test voice and the pronunciation attribute characteristic of the auxiliary language test voice according to the acoustic characteristic of the auxiliary language test voice and a pre-trained second model; the second model is obtained by training according to the acoustic features of the auxiliary language training voice, the actual phonemes of the auxiliary language training voice and the actual pronunciation attributes of the auxiliary language training voice.

15. The method of claim 14 wherein said second model includes N language phoneme recognition layers and N language pronunciation attribute recognition layers, and wherein said jth language phoneme recognition layer includes a pth_jA sub-output layer, a jth language pronunciation attribute identification layer including A_jA sub-output layer, j is a positive integer, j is less than or equal to N, and N is a positive integer;

when the language of the auxiliary language test voice is the jth language, obtaining the phoneme characteristic of the auxiliary language test voice and the pronunciation attribute characteristic of the auxiliary language test voice according to the acoustic characteristic of the auxiliary language test voice and a pre-trained second model, including:

inputting the acoustic feature of the auxiliary language test voice and the language identification of the auxiliary language test voice into the second model to obtain the pth_jInput data of sub-output layer and the A < th > layer_jInput data of the sub-output layer;

the P < th > is_jDetermining input data of a sub-output layer as phoneme characteristics of the auxiliary language test voice;

subjecting the first step A to_jAnd determining the input data of the sub-output layer as the pronunciation attribute characteristics of the auxiliary language test voice.

16. The method according to claim 12, wherein generating the fused feature of the test speech of the target language based on the speech feature of the test speech of the target language and the speech feature of the test speech of the assistant language comprises:

obtaining a target language probability of the target language test voice, an auxiliary language probability of the target language test voice, a target language probability of the auxiliary language test voice and an auxiliary language probability of the auxiliary language test voice according to the acoustic features of the target language test voice, the acoustic features of the auxiliary language test voice and a pre-trained language classification model; the language classification model is obtained by training according to the acoustic characteristics of the classified training voice and the actual language of the classified training voice;

determining the feature fusion weight of the target language test voice according to the target language probability of the target language test voice and the target language probability of the auxiliary language test voice;

determining the feature fusion weight of the auxiliary language test voice according to the auxiliary language probability of the target language test voice and the auxiliary language probability of the auxiliary language test voice;

and generating the fusion feature of the target language test voice according to the voice feature of the target language test voice, the feature fusion weight of the target language test voice, the voice feature of the auxiliary language test voice and the feature fusion weight of the auxiliary language test voice.

17. The method of claim 12, further comprising:

generating phonemes of the auxiliary language test voice and pronunciation attributes of the auxiliary language test voice according to the auxiliary language test voice;

and determining the pronunciation error of the user to be recognized according to the phoneme of the target language test voice, the pronunciation attribute of the target language test voice, the phoneme of the auxiliary language test voice, the pronunciation attribute of the auxiliary language test voice, the first standard phoneme, the first standard pronunciation attribute, the second standard phoneme and the second standard pronunciation attribute.

18. The method according to claim 17, wherein said generating phonemes for said test utterance in the auxiliary language and pronunciation attributes for said test utterance in the auxiliary language based on said test utterance in the auxiliary language comprises:

inputting the acoustic characteristics of the auxiliary language test voice and the language identification of the auxiliary language test voice into a pre-constructed second model to obtain the phonemes of the auxiliary language test voice output by the second model and the pronunciation attributes of the auxiliary language test voice; the second model is obtained by training according to the acoustic features of the auxiliary language training voice, the actual phonemes of the auxiliary language training voice and the actual pronunciation attributes of the auxiliary language training voice.

19. The method according to claim 17, wherein the pronunciation error of the user to be recognized comprises a phoneme error of the user to be recognized and/or a pronunciation attribute error of the user to be recognized;

20. The method according to claim 12, wherein the determining of the cause of the pronunciation deficiency of the user to be recognized according to the second recognition result comprises;

when the second recognition result second condition is determined, determining a first reason as the pronunciation defect generation reason of the user to be recognized; the first reason is that the language of the auxiliary language test voice has excessive influence on the user to be recognized.

21. A pronunciation defect recognition model training apparatus, the apparatus comprising:

22. A pronunciation defect recognition apparatus, the apparatus comprising:

the first recognition unit is used for inputting the voice characteristics of the target language test voice into a pre-trained pronunciation defect recognition model to obtain a first recognition result output by the pronunciation defect recognition model; wherein the pronunciation defect recognition model is trained by the pronunciation defect recognition model training method of claims 1-5;