CN111916108B

CN111916108B - Voice evaluation method and device

Info

Publication number: CN111916108B
Application number: CN202010723408.7A
Authority: CN
Inventors: 冯大航; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2021-04-02
Anticipated expiration: 2040-07-24
Also published as: CN111916108A

Abstract

The application relates to the technical field of voice processing, in particular to a voice evaluation method and a voice evaluation device, which are used for obtaining a voice to be evaluated; based on a trained evaluation model, taking the speech to be evaluated as an input parameter, identifying each phonon of the speech to be evaluated, determining phonon speech feature similarity between each phonon and a corresponding preset standard phonon, and determining fluent speech feature similarity corresponding to the speech to be evaluated according to the speech to be evaluated and the corresponding preset standard speech, wherein the phonon represents a phoneme corresponding to a speech pronunciation minimum unit; and determining an evaluation result of the speech to be evaluated according to the phoneme speech feature similarity and the fluent speech feature similarity. Therefore, the evaluation result of the speech to be evaluated is determined by combining the phonon speech feature similarity and the fluent speech feature similarity of the speech to be evaluated, and the speech evaluation efficiency and accuracy are improved.

Description

Voice evaluation method and device

Technical Field

The present application relates to the field of speech evaluation technologies, and in particular, to a speech evaluation method and apparatus.

Background

At present, after an intelligent device carries out speech synthesis and generates synthesized speech, the quality of the synthesized speech needs to be evaluated and scored, in the prior art, the speech to be evaluated is usually evaluated manually, however, manual evaluation often has subjectivity, the evaluation score of the speech to be evaluated is not accurate, and the evaluation mode is low in efficiency because the speech to be evaluated needs to be manually scored one by one.

Disclosure of Invention

The embodiment of the application provides a voice evaluation method and a voice evaluation device, so that the efficiency and the accuracy of voice evaluation are improved.

The embodiment of the application provides the following specific technical scheme:

a speech evaluation method comprises the following steps:

acquiring a voice to be evaluated;

based on a trained evaluation model, taking the speech to be evaluated as an input parameter, identifying each phonon of the speech to be evaluated, determining phonon speech feature similarity between each phonon and a corresponding preset standard phonon, and determining fluent speech feature similarity corresponding to the speech to be evaluated according to the speech to be evaluated and the corresponding preset standard speech, wherein the phonon represents a phoneme corresponding to a speech pronunciation minimum unit;

and determining an evaluation result of the speech to be evaluated according to the phoneme speech feature similarity and the fluent speech feature similarity.

Optionally, further comprising: acquiring a voice text corresponding to a voice to be evaluated;

based on the trained evaluation model, using the speech to be evaluated as an input parameter, and identifying each phoneme of the speech to be evaluated, specifically comprising:

and based on a trained evaluation model, recognizing each phoneme of the speech to be evaluated based on the speech text by taking the speech to be evaluated and the speech text as input parameters.

Optionally, determining the phoneme voice feature similarity between each phoneme and the corresponding preset standard phoneme specifically includes:

respectively determining the corresponding phonon characteristics of each phonon;

classifying the phones according to the determined phone characteristics, and respectively determining the preset phone categories to which the phones belong;

and respectively determining the similarity of the phonetic-sound characteristics between each phoneme and a preset standard phoneme included in the corresponding preset phoneme category.

Optionally, determining an evaluation result of the speech to be evaluated specifically includes:

determining the pronunciation score of the speech to be evaluated according to the similarity of the characteristics of the phonons and the speeches;

determining the fluency score of the speech to be evaluated according to the fluency speech feature similarity;

carrying out weighted average on the pronunciation score and the fluency score to obtain a final evaluation score of the speech to be evaluated;

and obtaining an evaluation result of the speech to be evaluated according to the final evaluation score.

Optionally, obtaining an evaluation result of the speech to be evaluated according to the final evaluation score specifically includes:

if the final evaluation score is determined to be greater than or equal to a preset first score threshold value, determining the grade corresponding to the speech to be evaluated as a first grade;

if the final evaluation score is smaller than the preset first score threshold and is larger than or equal to a second preset score threshold, determining that the grade corresponding to the voice to be evaluated is a second grade, wherein the preset first score threshold is larger than the preset second score threshold;

and if the final evaluation score is smaller than a second preset score threshold value, determining that the grade corresponding to the voice to be evaluated is a third grade, wherein the voice quality of the first grade is larger than that of the second grade, and the voice quality of the second grade is larger than that of the third grade.

Optionally, the evaluation model training mode is as follows:

acquiring a standard voice sample set and a standard voice text corresponding to each standard voice sample in the standard voice sample set;

respectively carrying out voice simulation on each standard voice sample to obtain each simulated voice sample;

respectively inputting the standard voice samples, the corresponding standard voice texts and the simulated voice samples into the evaluation model for training, identifying the phones corresponding to the standard voices and the phones corresponding to the simulated voice samples through a feature module of the evaluation model, determining the phone voice feature similarity between the pronunciation of each phone of each standard voice and each phone of the corresponding simulated voice sample through a phone network module of the evaluation model to obtain the simulated pronunciation score of the simulated voice sample, determining the fluent voice feature similarity corresponding to the simulated voice through a convolution network module of the evaluation model to obtain the simulated fluent score of the simulated voice sample, and determining the final evaluation score corresponding to each simulated voice sample according to each determined simulated pronunciation score and each determined simulated fluent score, and obtaining the trained evaluation model until the target function of the evaluation model converges, wherein the target function is the minimization of the cross entropy function between the simulated voice sample and the standard voice sample.

Optionally, performing voice simulation on each standard voice sample in the standard voice sample set to obtain each simulated voice sample, specifically including:

according to a preset first change intensity coefficient, respectively performing tone quality change simulation on each standard voice sample to obtain a simulated voice sample after corresponding simulation, wherein the pronunciation change simulation at least comprises one or any combination of the following processing modes: voice noise adding, spectrum noise reduction, spectrum distortion and base frequency adjustment;

and/or performing tone change simulation on each standard voice sample according to a preset second change intensity coefficient to obtain a corresponding simulated voice sample, wherein the fluency change simulation at least comprises one or any combination of the following processing modes: voice deformation, lengthening and shortening of pronunciation time, spectrum deformation and fundamental frequency deformation.

A speech evaluation apparatus comprising:

the first acquisition module is used for acquiring the voice to be evaluated;

the evaluation module is used for identifying each phoneme of the speech to be evaluated based on a trained evaluation model by taking the speech to be evaluated as an input parameter, determining phoneme speech feature similarity between each phoneme and a corresponding preset standard phoneme, and determining fluent speech feature similarity corresponding to the speech to be evaluated according to the speech to be evaluated and the corresponding preset standard speech, wherein the phoneme represents a phoneme corresponding to a minimum pronunciation unit of the speech;

and the determining module is used for determining the evaluation result of the speech to be evaluated according to the phoneme speech feature similarity and the fluent speech feature similarity.

Optionally, the first obtaining module is further configured to: acquiring a voice text corresponding to a voice to be evaluated;

the evaluation module is specifically configured to:

Optionally, when determining the phoneme voice feature similarity between each phoneme and the corresponding preset standard phoneme, the evaluating module is specifically configured to:

Optionally, when determining the evaluation result of the speech to be evaluated, the determining module is specifically configured to:

Optionally, when obtaining the evaluation result of the speech to be evaluated according to the final evaluation score, the determining module is specifically configured to:

Optionally, the training mode for the evaluation module further includes:

the second acquisition module is used for acquiring a standard voice sample set and a standard voice text corresponding to each standard voice sample in the standard voice sample set;

the simulation module is used for respectively carrying out voice simulation on each standard voice sample to obtain each simulated voice sample;

a processing module, configured to input each standard voice sample, a corresponding standard voice text, and a simulated voice sample into the evaluation model for training, recognize a phone corresponding to each standard voice and a phone corresponding to each simulated voice sample through the feature module of the evaluation model, determine phone voice feature similarity between a phone of each standard voice and a phone of each phone of a corresponding simulated voice sample through the phone network module of the evaluation model, obtain a simulated phone score of the simulated voice sample, determine a smooth phone feature similarity corresponding to the simulated voice through the convolution network module of the evaluation model, obtain a simulated fluent phone score of the simulated voice sample, and determine each simulated phone score and each determined simulated fluent score according to the simulated phone score, and determining the final evaluation score corresponding to each simulated voice sample until the target function of the evaluation model converges to obtain the trained evaluation model, wherein the target function is the minimization of the cross entropy function between the simulated voice sample and the standard voice sample.

Optionally, the simulation module is specifically configured to:

An electronic device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the voice evaluation method.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned speech evaluation method.

In the embodiment of the application, the speech to be evaluated is obtained, based on a trained evaluation model, each phoneme to be evaluated is identified by taking the speech to be evaluated as an input parameter, the phoneme speech feature similarity between each phoneme and a corresponding preset standard phoneme is determined, the fluent speech feature similarity corresponding to the speech to be evaluated is determined according to the speech to be evaluated and the corresponding preset standard speech, and the evaluation result of the speech to be evaluated is determined according to the phoneme speech feature similarity and the fluent speech feature similarity, so that after the speech to be evaluated is obtained, the phoneme speech feature similarity and the fluent speech feature similarity of the speech to be evaluated are respectively obtained based on the trained evaluation model, the automatic evaluation of the speech to be evaluated can be realized, the efficiency is improved, and in the evaluation, the phoneme speech feature similarity between the phoneme of the speech to be evaluated and the corresponding preset standard phoneme is compared, and the fluent speech feature similarity of the speech to be evaluated is determined by comparing the speech to be evaluated with the preset standard speech, so that the pronunciation and the fluency of the speech to be evaluated are evaluated simultaneously, and the accuracy of evaluation can be improved.

Drawings

FIG. 1 is a flow chart of a speech evaluation method in an embodiment of the present application;

FIG. 2 is a flow chart of an evaluation model training mode in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an evaluation model in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a condition GAN in an embodiment of the present application;

FIG. 5 is a schematic diagram of an optimization of a GAN model in an embodiment of the present application;

FIG. 6 is a flow chart of another speech assessment method in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a 2-layer GRU network in the embodiment of the present application;

FIG. 8 is a flowchart of a method for evaluating speech in combination with a speech text in an embodiment of the present application;

FIG. 9 is a schematic diagram of a fluency network in an embodiment of the present application;

FIG. 10 is a diagram illustrating an RNN scoring network architecture according to an embodiment of the present invention;

fig. 11 is a structural diagram of a GRU network in an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a speech evaluation device in an embodiment of the present application;

fig. 13 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, the application of speech synthesis is very wide, for example, a speech text to be synthesized is input into an intelligent device, the intelligent device can perform speech synthesis on the speech text and output the synthesized speech after synthesis, after the intelligent device performs speech synthesis and generates the synthesized speech, the quality of the synthesized speech needs to be evaluated and scored, so that relevant workers can adjust the performance of the intelligent device in real time.

In the embodiment of the application, the voice to be evaluated is obtained, based on a trained evaluation model, each phoneme of the voice to be evaluated is identified by taking the voice to be evaluated as an input parameter, an evaluation result of the voice to be evaluated can be determined by determining the phoneme voice feature similarity between each phoneme and a corresponding preset standard phoneme and determining the fluent voice feature similarity between the voice to be evaluated and the corresponding preset standard voice, thus, the voice to be evaluated is evaluated and scored through the trained evaluation model, a manual evaluation mode is further converted into an automatic evaluation mode, the efficiency of the voice to be evaluated in evaluation can be improved, and the phoneme voice feature similarity between the phoneme of the voice to be evaluated and the preset standard phoneme is compared and the fluent voice feature similarity between the voice to be evaluated and the preset standard voice is determined, and then the speech to be evaluated is evaluated, so that the accuracy of the evaluation time of the speech to be evaluated can be improved.

Based on the foregoing embodiment, referring to fig. 1, a flowchart of a speech evaluation method in the embodiment of the present application is specifically included:

step 100: and acquiring the voice to be evaluated.

In the embodiment of the application, the intelligent device synthesizes the voice text to be synthesized into synthesized voice, namely, to-be-evaluated voice is obtained, then, the server can obtain the to-be-evaluated voice generated by the intelligent device, for example, the voice text 'good weather today' is input into the intelligent device, further, the intelligent device can generate the synthesized voice to be evaluated 'good weather today' based on a trained voice synthesis model, then the to-be-evaluated voice is sent to the server, further, the server obtains the to-be-evaluated voice and carries out subsequent evaluation processing, further, for example, the voice text is input into the intelligent device, further, the intelligent device can generate the synthesized voice to be evaluated based on the trained voice synthesis model, and can evaluate the to-be-evaluated voice through the evaluation model, the evaluation model may be obtained by the intelligent device from a server, and the execution subject is not limited in this embodiment of the application.

Step 110: based on a trained evaluation model, recognizing each phoneme of the speech to be evaluated by taking the speech to be evaluated as an input parameter, determining phoneme speech feature similarity between each phoneme and a corresponding preset standard phoneme, and determining fluent speech feature similarity corresponding to the speech to be evaluated according to the speech to be evaluated and the corresponding preset standard speech.

Wherein, the phones represent the phones corresponding to the minimum unit of pronunciation.

Specifically, based on a trained evaluation model, the speech to be evaluated is taken as an input parameter, each phoneme of the speech to be evaluated is identified, then, the phoneme speech feature similarity between each phoneme and the corresponding preset standard phoneme is determined, the speech to be evaluated and the corresponding preset standard speech are compared while the phoneme speech feature similarity is obtained, and the smooth speech feature similarity corresponding to the speech to be evaluated is determined, so that the pronunciation and the smoothness of the speech to be evaluated are evaluated simultaneously, and the efficiency of the speech to be evaluated during evaluation can be improved.

In the embodiment of the present application, when obtaining a speech to be evaluated, a speech text corresponding to the speech to be evaluated may also be obtained, and then, when recognizing each phoneme of the speech to be evaluated based on a trained evaluation model and using the speech to be evaluated as an input parameter, the method specifically includes:

based on the trained evaluation module, the speech to be evaluated and the speech text are taken as input parameters, and based on the speech text, each phoneme of the speech to be evaluated is identified.

Step 120: and determining the evaluation result of the speech to be evaluated according to the phoneme speech feature similarity and the fluent speech feature similarity.

In the embodiment of the present application, the evaluation result of the speech to be evaluated is determined according to the phonon speech feature similarity and the fluent speech feature similarity of the speech to be evaluated, and the following steps of obtaining the phonon speech feature similarity and the fluent speech feature similarity of the speech to be evaluated in the embodiment of the present application are respectively explained in detail, and can be divided into the following two parts:

a first part: in the embodiment of the present application, determining the similarity of the phonetic character between each phoneme and the corresponding preset standard phoneme specifically includes:

s1: and respectively determining the corresponding phonon characteristics of each phonon.

In the embodiment of the application, after the speech to be evaluated is obtained, the speech to be evaluated is input into the trained evaluation model, and the speech to be evaluated is composed of phones, so that after the speech to be evaluated is obtained, each phone of the speech to be evaluated can be identified, and the phone speech features corresponding to each phone are respectively determined.

For example, when a speech to be evaluated, "tiger cubs play with pet dogs" is acquired, each phoneme "l" "ao" h u iou z ai v ch ong u q van ua sh ua "of the speech to be evaluated is identified, then phoneme speech features corresponding to each phoneme are respectively determined, each phoneme corresponds to one phoneme speech feature, and if the speech to be evaluated consists of 15 phonemes, feature extraction is performed on each phoneme of the speech to be evaluated, so that 15 phoneme speech features are obtained in total.

Common speech features are: mel Cepstrum log-Mel, fundamental Frequency F0, Linear Predictive Coding (LPC) features, fBank, Mel Frequency Cepstrum Coefficient (MFCC) features, and the like.

S2: and classifying the phones according to the determined phone characteristics, and respectively determining the preset phone categories to which the phones belong.

In the embodiment of the application, after the characteristics of each phoneme of the speech to be evaluated are determined, the determined characteristics of each phoneme are compared with the characteristics corresponding to the preset phoneme categories, the phoneme category corresponding to the characteristics of each phoneme characteristic which are most similar to the characteristics corresponding to the preset phoneme categories is determined, the phonemes are classified into the corresponding phoneme categories, and the preset phoneme categories to which the phonemes belong are determined respectively.

The phonon characteristics are voice parameters of the phonons and are used for classifying the phonons, and the characteristics of each class of phonons are similar, so that the classification of the phonons can be realized. For example, if the obtained speech to be evaluated is a tiger and tiger, the phones corresponding to the speech to be evaluated are (1) "l", (2) "ao", (3) "l", (4) "ao", (5) "l", (6) "ao", the phone features between each phone "l" are similar, and the phone features between each phone "ao" are also similar, so that the phones can be classified, the phone "l" of (1), (3), (5) are all classified into the corresponding preset "l" category, and the phone "ao" of (2), (4), (6) are all classified into the corresponding preset "ao" category.

The number of preset phonon categories is determined according to the number of the categories of the phonons, and the phonons are divided into the categories by the number of different phonons.

S3: and respectively determining the similarity of the phonetic character between each phoneme and the preset standard phoneme included in the corresponding preset phoneme category.

In the embodiment of the application, according to the phonon voice features corresponding to each phonon, the voice features of each phonon are respectively compared with the phonon voice features of the preset standard phonons included in the corresponding preset phonon category in a summary manner, and the phonon voice feature similarity between each phonon and the preset standard phonons included in the corresponding preset phonon category is determined.

If the number of the phones in the phone category is one, the phones are compared with the preset standard phones in the corresponding preset phone category to determine the phone voice feature similarity between the phones and the corresponding preset standard phones, for example, if the phone to be evaluated is a tiger, the phone to be evaluated is a "l", "ao", "h" or "u", and at this time, the phone to be evaluated in each phone category is one.

If the number of the phones of the to-be-evaluated voice in the phone category is multiple, the voice features of each phone are respectively compared with the voice features of the preset standard phones in the corresponding preset phone category, the phone voice feature similarity between each phone voice feature and the preset standard phone voice feature in the corresponding preset phone category is determined, for example, the to-be-evaluated voice "tiger and tiger" includes 3 phones "l" and 3 phones "ao", the phone voice feature of the three phones "l" is respectively compared with the phone voice feature of the preset standard phone, the phone voice feature of the three phones "ao" is respectively compared with the phone voice feature of the preset standard phone, and the phone voice feature similarity between the phone and the corresponding preset standard phone is determined.

A second part: in the embodiment of the present application, determining the smooth speech feature similarity corresponding to the speech to be evaluated according to the speech to be evaluated and the corresponding preset standard speech specifically includes:

and comparing the speech to be evaluated with the corresponding preset standard speech to determine the smooth speech characteristic similarity between the speech to be evaluated and the preset standard speech.

Specifically, after the speech to be evaluated is input into the trained evaluation model, feature extraction is performed on the speech to be evaluated, the speech features of the speech to be evaluated are compared with the speech features of the corresponding preset standard speech, and the smooth speech feature similarity between the speech features of the speech to be evaluated and the speech features of the preset standard speech is determined.

In the embodiment of the application, after the speech to be evaluated is input into the trained evaluation model, the speech to be evaluated is subjected to feature extraction to obtain speech features of the speech to be evaluated, the speech features are subjected to framing, and then each frame of the speech to be evaluated is respectively compared with each frame corresponding to the preset standard speech, each frame corresponds to a smooth speech feature similarity, and then the smooth speech feature similarity of the speech to be evaluated is obtained.

It should be noted that, in the embodiment of the present application, the steps of determining the phoneme speech feature similarity and the fluent speech feature similarity of the speech to be evaluated are executed simultaneously, so that a parallel evaluation process of the speech to be evaluated can be implemented, and the efficiency of evaluating the speech to be evaluated is improved.

After obtaining the phoneme voice feature similarity and the fluent voice feature similarity, when determining the evaluation result of the voice to be evaluated according to the phoneme voice feature similarity and the fluent voice feature similarity, the method specifically includes:

s1: and determining the pronunciation score of the speech to be evaluated according to the similarity of the characteristics of the phonons and the speech.

In the embodiment of the application, after the phoneme voice feature similarity corresponding to each phoneme is determined, the pronunciation score corresponding to each phoneme of the voice to be evaluated is determined according to the phoneme feature similarity, and then the determined pronunciation score of each phoneme is weighted and averaged to obtain the pronunciation score of the voice to be evaluated.

When the pronunciation score corresponding to the phoneme is obtained according to the phoneme feature similarity, the pronunciation score corresponding to the phoneme can be further determined according to the preset association relationship between the phoneme voice feature similarity and the pronunciation score, for example, when the phoneme voice feature similarity is 80%, the corresponding pronunciation score is 80.

It should be noted that, in the embodiment of the present application, a processing manner of the pronunciation score of each phone of the speech to be evaluated is not limited, for example, when the pronunciation score is processed, the pronunciation scores of the phones may be directly added to obtain the pronunciation score of the speech to be evaluated.

For another example, the highest pronunciation score and the lowest pronunciation score in the pronunciation scores of each phoneme of the speech to be evaluated may be deleted, and the pronunciation scores of the remaining phonemes are weighted averagely to finally obtain the pronunciation score of the speech to be evaluated, so that the accuracy of the speech to be evaluated in the evaluation may be further improved.

S2: and determining the fluency score of the speech to be evaluated according to the fluency speech feature similarity.

In the embodiment of the application, after the fluency feature similarity of the voice to be evaluated is obtained, the fluency score of the voice to be evaluated is determined according to the fluency voice feature similarity and the corresponding relation between the preset fluency voice feature similarity and the fluency score.

S3: and carrying out weighted average on the pronunciation score and the fluency score to obtain a final evaluation score of the speech to be evaluated.

S4: and obtaining an evaluation result of the speech to be evaluated according to the final evaluation score.

In this embodiment of the application, when the step S4 is executed, the method specifically includes:

a1: and if the final evaluation score is determined to be larger than or equal to the preset first score threshold, determining the grade corresponding to the speech to be evaluated as a first grade.

In the embodiment of the present application, the preset first score threshold may be set according to an actual requirement, for example, when the total score of the final evaluation score is 100 minutes, the preset first score threshold may be set to 80 minutes, and at this time, if it is determined that the final evaluation score is greater than or equal to 80 minutes, the level corresponding to the speech to be evaluated is determined to be the first level.

The first grade represents that the quality of the speech to be evaluated is high, that is, the pronunciation of the speech to be evaluated is accurate, and the fluency is high, and the first grade can be a high-quality grade, for example.

A2: and if the final evaluation score is smaller than a preset first score threshold value and is larger than or equal to a second preset score threshold value, determining the grade corresponding to the voice to be evaluated as a second grade.

And the preset first score threshold is greater than the preset second score threshold.

In this embodiment of the application, the preset second score threshold may be set according to an actual requirement, for example, when the total score of the final evaluation score is 100 scores and the preset first score threshold is 80 scores, the preset second score threshold may be set to 60 scores, at this time, if it is determined that the final evaluation score is 70 scores, the final evaluation score of the speech to be evaluated is smaller than the preset first score threshold and larger than the preset second score threshold, and the level corresponding to the speech to be evaluated is determined to be the second level.

The second level represents the general quality of the speech to be evaluated, that is, the pronunciation quality and fluency of the speech to be evaluated are general, and the second level may be a good level, for example.

A3: and if the final evaluation score is smaller than a second preset score threshold value, determining the grade corresponding to the voice to be evaluated as a third grade.

The voice quality of the first grade is higher than that of the second grade, and the voice quality of the second grade is higher than that of the third grade.

In this embodiment of the application, for example, when the total score of the final evaluation score is 100 scores and the preset second score threshold is 60 scores, at this time, if it is determined that the final evaluation score of the speech to be evaluated is 40 scores, the final evaluation score of the speech to be evaluated is smaller than the preset second score threshold, and the level corresponding to the speech to be evaluated is determined to be a third level.

The third level represents that the quality of the speech to be evaluated is poor, that is, the pronunciation of the speech to be evaluated is inaccurate and the fluency is poor, and may be, for example, a poor level.

Further, in order to refine the quality evaluation result of the speech to be evaluated, the number of levels of the quality evaluation level of the speech to be evaluated can be set according to the actual requirement, which is not limited in the embodiment of the present application.

In the embodiment of the application, a voice to be evaluated is obtained, each phoneme of the voice to be evaluated is identified based on a trained evaluation model, the phoneme voice feature similarity between each phoneme and a corresponding preset standard phoneme is determined, the fluent voice feature similarity corresponding to the voice to be evaluated is determined according to the voice to be evaluated and the corresponding preset standard voice while the phoneme voice feature similarity is determined, and then an evaluation result of the voice to be evaluated is determined according to the phoneme voice feature similarity and the fluent voice feature similarity of the voice to be evaluated, so that the automatic evaluation of the voice to be evaluated can be realized, and the accuracy of the voice to be evaluated in evaluation can be improved.

Based on the above embodiment, the evaluation model training method in the embodiment of the present application is described in detail below, and as shown in fig. 2, it is a flowchart of the evaluation model training method in the embodiment of the present application, and specifically includes:

step 200: and acquiring a standard voice text corresponding to each standard voice sample in the standard voice sample set.

In the embodiment of the application, a standard voice sample set is obtained, and the standard voice sample set comprises a plurality of standard voice samples.

Step 210: and respectively carrying out voice simulation on each standard voice sample to obtain each simulated voice sample.

In the embodiment of the present application, when performing the voice simulation on each standard voice sample, the following three different processing modes may be specifically adopted.

The first processing mode is as follows:

when step 210 is executed, the method specifically includes: and respectively carrying out tone quality change simulation on each standard voice sample according to a preset first change intensity coefficient to obtain a corresponding simulated voice sample.

Wherein, the pronunciation change simulation at least comprises one or any combination of the following processing modes: voice noise addition, spectrum noise reduction, and spectrum warping.

In the embodiment of the application, tone quality change simulation is performed on each standard voice sample in a signal processing mode, so that a simulated voice sample after corresponding simulation is obtained.

The signal processing method may be pronunciation change simulation, and the pronunciation change simulation may be, for example, voice noise addition, spectrum noise reduction, spectrum warping, fundamental frequency adjustment, and the like, which is not limited in the embodiment of the present application.

The second processing mode is as follows:

when step 210 is executed, the method specifically includes: and performing tone variation simulation on each standard voice sample according to a preset second variation intensity coefficient to obtain a corresponding simulated voice sample.

The fluency change simulation at least comprises the following processing modes in one or any combination: voice deformation, lengthening and shortening of pronunciation time, spectrum deformation and fundamental frequency deformation.

In the embodiment of the application, tone change simulation is performed on each standard voice sample in a signal processing mode, so that a simulated voice sample after corresponding simulation is obtained.

The signal processing method may be a pitch change simulation, and the pitch change simulation may be, for example, shortening of pronunciation duration, deformation of fundamental frequency, and the like, which is not limited in the embodiment of the present application.

The third processing mode is as follows:

when step 210 is executed, the method specifically includes:

s1: and respectively carrying out tone quality change simulation on each standard voice sample according to a preset first change intensity coefficient to obtain a corresponding simulated tone quality voice sample.

S2: and respectively carrying out tone variation simulation on each simulated voice quality voice sample according to a preset second variation intensity coefficient to obtain a corresponding simulated voice sample.

In the embodiment of the application, tone quality change simulation is performed on each standard voice sample according to a preset first change intensity coefficient to obtain a corresponding simulated tone quality voice sample, and then tone simulation change is performed on the simulated tone quality voice sample according to a preset second change intensity coefficient to obtain a corresponding simulated voice sample.

Step 220: respectively inputting each standard voice sample, the corresponding standard voice text and the simulated voice sample into an evaluation model for training, identifying the phonons corresponding to each standard voice and the phonons corresponding to each simulated voice sample through a characteristic module of the evaluation model, determining the phonon voice characteristic similarity between the pronunciation of each phonon of each standard voice and each phonon of the corresponding simulated voice sample through a phonon network module of the evaluation model to obtain the simulated pronunciation score of the simulated voice sample, determining the fluent voice characteristic similarity corresponding to the simulated voice through a convolution network module of the evaluation model to obtain the simulated fluent score of the simulated voice sample, determining the final evaluation score corresponding to each simulated voice sample according to each determined simulated pronunciation score and each determined simulated fluent score until a target function of the evaluation model converges, and obtaining the trained evaluation model.

Wherein the objective function is the minimization of a cross entropy function between the simulated voice sample and the standard voice sample.

In this embodiment, when step 220 is executed, the method specifically includes:

s1: and respectively inputting each standard voice sample, the corresponding standard voice text and the simulated voice sample into an evaluation model for training.

S2: and identifying the phonons corresponding to the standard voices and the phonons corresponding to the simulated voice samples through a characteristic module of the evaluation model.

S3: and determining the phonon voice feature similarity between the pronunciation of each phonon of each standard voice and each phonon of the corresponding simulated voice sample through a phonon network module of the evaluation model, and obtaining the simulated pronunciation score of the simulated voice sample.

S4: and determining the similarity of the smooth voice characteristics corresponding to the simulated voice through a convolution network module of the evaluation model, and obtaining the simulated smooth score of the simulated voice sample.

S5: and determining a final evaluation score corresponding to each simulated voice sample according to each determined simulated pronunciation score and each determined simulated fluency score until the target function of the evaluation model converges, and obtaining the trained evaluation model.

In the embodiment of the application, after the simulated pronunciation score and the simulated fluency score are determined, the final evaluation score corresponding to each simulated voice sample is determined according to each determined simulated pronunciation score and each determined simulated fluency score until the target function of the evaluation model is converged, and the evaluation model after training is obtained.

The evaluation model may be a Generative Adaptive Network (GAN) model, as shown in fig. 3, and is a schematic structural diagram of the evaluation model in the embodiment of the present application, as shown in the figure, random noise z-p (z) is a preset first variation intensity coefficient and a preset second variation intensity coefficient, the generator G is configured to perform tone quality variation simulation on each standard voice sample according to the preset first variation intensity coefficient, to obtain a corresponding simulated voice sample, and is further configured to perform tone variation simulation on each standard voice sample according to the preset second variation intensity coefficient, to obtain a corresponding simulated voice sample, where the simulated voice sample is G_θ(z), the real data x-p (x) is the acquired speech to be evaluated, the discriminator D and the true/false are used for recognizing each phoneme of the speech to be evaluated by using the speech to be evaluated as an input parameter, determining phoneme speech feature similarity between each phoneme and a corresponding preset standard phoneme, determining fluent speech feature similarity corresponding to the speech to be evaluated according to the speech to be evaluated and the corresponding preset standard speech, determining a final evaluation score of the speech to be evaluated according to the phoneme speech feature similarity and the fluent speech feature similarity, if the similarity between the speech to be evaluated and a simulated speech sample is high, the speech to be evaluated approaches to the false, if the similarity between the speech to be evaluated and the standard speech sample is high, the speech to be evaluated approaches to the true, the discriminator adopts a condition GAN, as shown in fig. 4, which is a structural diagram of the condition GAN in the embodiment of the present application, in training, the input condition is markedAnd training the speech text corresponding to the quasi-speech only by a discriminator during training for evaluating the truth degree of the speech sample, wherein the expected value of the discriminator is different from true 1, and false 0 is obtained by presetting a first variation intensity coefficient or a second variation intensity coefficient.

For example: text: tiger #1 cub #2 and #1 pet dog #1 play #4

And (3) pinyin sequence: lao2 hu3 you4 zai3 yu2 chong3 wu4 quan3 wan2 shua3

Phonon and prosodic sequences: l ao2 h u3 iou4 z ai3 v2 ch ong3 u4 q van3 ua 2 sh ua3

The phonon sequence is processed by up-sampling (upsample), and input into a discriminator for training.

In the embodiment of the application, a standard voice sample, a simulated voice sample subjected to voice simulation and a corresponding voice text (phonon) are input into an evaluation model for training, and the larger the intensity change (change intensity coefficient) of the simulated voice sample is, the larger the expected value of a discriminator is, the intensity change is close to noise, and the expected value is close to 0.

Referring to fig. 5, which is a schematic diagram of the GAN model in the embodiment of the present application, the fixed G parameter is unchanged, and the parameter of D is optimized, i.e. the maximum maxV (D, G) is equivalent to min [ -V (D, G) ], so the loss function of D is equivalent as follows:

wherein J represents the loss function of the discriminator D, θ^DRepresenting the discriminator parameter, θ^GRepresenting generator parameters, x representing training data samples,

representing data samples of the generator, E representing expected values of the function, D representing a discriminator in the GAN, p_gRepresenting a pass parameter of theta_gIs mapped to a high-dimensional data space, G representing a generator in the GAN model.

The GAN objective function can be expressed as:

wherein E (#) represents the expected value of the distribution function, p_data(x) Representing the distribution of standard speech samples, p_z(z) is the noise distribution defined in the lower dimension, and y is the condition vector.

Further, after training, an evaluation model is obtained, when the evaluation model is used for evaluating the speech to be evaluated, the speech to be evaluated and the corresponding speech text (sound sub-string) can be input, and the speech to be evaluated is scored through a discriminator.

In the embodiment of the application, the pronunciation and the fluency of the voice to be evaluated are respectively scored by training the evaluation model, so that the automatic evaluation of the voice to be evaluated can be realized, and the accuracy of the automatic evaluation is improved.

Based on the foregoing embodiment, referring to fig. 6, a flowchart of another speech evaluation method in the embodiment of the present application is specifically included:

step 600: and acquiring the voice to be evaluated.

Step 610: and performing feature extraction on the voice to be evaluated through a dimensionality reduction network, and determining the phonon feature of the phonon of the voice to be evaluated and the voice feature of the voice to be evaluated.

Step 620: and identifying the voice characteristics of the voice to be evaluated through the characteristic identification module, and identifying each phoneme of the voice to be evaluated.

In the embodiment of the application, when only the speech to be evaluated exists, the speech characteristics of the speech to be evaluated are identified through the characteristic identification module, the phoneme sequence of the speech to be evaluated and the pronunciation duration of each phoneme are identified, and then each phoneme of the speech to be evaluated is identified.

Step 630: and inputting the phonon voice characteristics of the voice to be evaluated into a phonon network to obtain the pronunciation score of the voice to be evaluated.

In the embodiment of the application, after the phonon voice features of the voice to be evaluated are input into the phonon network, each phonon is classified, the preset phonon category to which each phonon belongs is respectively determined, then, the phonon voice feature similarity between each phonon and the preset standard phonon included in the corresponding preset phonon category is respectively determined, the pronunciation score of each phonon is obtained, and the determined pronunciation score of each phonon is weighted and averaged to obtain the pronunciation score of the voice to be evaluated.

Step 640: and inputting the voice characteristics of the voice to be evaluated into the fluency network to obtain the fluency score of the voice to be evaluated.

In the embodiment of the application, the voice feature of the voice to be evaluated is input into the fluency network, the fluency voice feature similarity corresponding to the voice to be evaluated is determined, and the fluency score of the voice to be evaluated is determined according to the fluency voice feature similarity.

The fluency Network is composed of a 1-dimensional convolution and 2 layers of GRUs, and the 2 layers of GRUs are scoring networks of Recurrent Neural Networks (RNNs), which is shown in fig. 7 and is a structural schematic diagram of the 2 layers of GRUs in the embodiment of the present application.

Further, a fluency network can be formed by the 2-layer GRU network.

Wherein, the steps 630-640 and 460 are executed simultaneously, so that the pronunciation and the fluency of the speech to be evaluated can be evaluated in parallel, and the efficiency of evaluation is improved.

Step 650: and obtaining the final evaluation score of the speech to be evaluated according to the pronunciation score and the fluency score of the speech to be evaluated.

In the embodiment of the application, the evaluation model comprises the phonon network and the fluency network, the pronunciation of the speech to be evaluated is scored through the phonon network, the fluency of the speech to be evaluated is scored through the fluency network, automatic evaluation of the speech to be evaluated can be realized, and the accuracy of evaluation is improved.

Based on the above embodiments, referring to fig. 8, a flowchart of a method for evaluating combining speech and speech text in the embodiment of the present application is shown.

Step 800: and acquiring the speech to be evaluated and the speech text corresponding to the speech to be evaluated.

Step 810: and performing feature extraction on the voice to be evaluated through the dimensionality reduction network, and determining the voice feature of the voice to be evaluated.

Step 820: and identifying the voice characteristics of the voice to be evaluated through the characteristic alignment module, and identifying each phoneme of the voice to be evaluated.

In the embodiment of the application, when the speech to be evaluated and the speech text corresponding to the speech to be evaluated are obtained, the feature alignment module finds the starting time and the ending time (pronunciation duration) of each phoneme in the speech to be evaluated, and then identifies each phoneme of the speech to be evaluated.

Step 830: and inputting the phonon voice characteristics of the voice to be evaluated into a phonon network to obtain the pronunciation score of the voice to be evaluated.

In this embodiment, the phonon network adopts 1-dimensional convolution +2 layers of GRUs, as shown in fig. 9, which is a schematic structural diagram of the fluency network in this embodiment of the present application, as shown in fig. 10, which is a structural diagram of an RNN scoring network in this embodiment of the present application, an input of a second layer of GRUs is an output + an input of a first layer of GRUs, and an output of the second layer of networks is an output + an input of a second layer of GRUs.

The middle of the 2-layer GRU is a residual connection, and the residual connection is a ResNet residual network, which is shown in fig. 11 and is a structural diagram of the GRU network in the embodiment of the present application, and the output of the network is network output + input.

Step 840: and inputting the voice characteristics of the voice to be evaluated into the fluency network to obtain the fluency score of the voice to be evaluated.

Step 850: and obtaining the final evaluation score of the speech to be evaluated according to the pronunciation score and the fluency score of the speech to be evaluated.

In the embodiment of the application, the speech to be evaluated and the corresponding speech text are input into the evaluation model, the evaluation model comprises a phonon network and a fluency network, the pronunciation of the speech to be evaluated is scored through the phonon network, and the fluency of the speech to be evaluated is scored through the fluency network, so that the automatic evaluation of the speech to be evaluated can be realized, and the accuracy of evaluation time is improved.

Based on the same inventive concept, the embodiment of the present application further provides a speech evaluating apparatus, where the speech evaluating apparatus may be, for example, the server in the foregoing embodiment, and the speech evaluating apparatus may be a hardware structure, a software module, or a hardware structure plus a software module. Based on the above embodiment, referring to fig. 12, a schematic structural diagram of a speech evaluation device in the embodiment of the present application is shown, which specifically includes:

a first obtaining module 1200, configured to obtain a speech to be evaluated;

the evaluation module 1210 is configured to identify each phoneme of the speech to be evaluated based on a trained evaluation model, with the speech to be evaluated as an input parameter, determine a phoneme speech feature similarity between each phoneme and a corresponding preset standard phoneme, and determine a fluent speech feature similarity corresponding to the speech to be evaluated according to the speech to be evaluated and the corresponding preset standard speech, where the phoneme represents a phoneme corresponding to a speech pronunciation minimum unit;

the determining module 1220 is configured to determine an evaluation result of the speech to be evaluated according to the phoneme speech feature similarity and the fluent speech feature similarity.

Optionally, the first obtaining module 1200 is further configured to: acquiring a voice text corresponding to a voice to be evaluated;

the evaluation module 1210 is specifically configured to:

Optionally, the evaluating module 1210 is specifically configured to:

Optionally, when determining the evaluation result of the speech to be evaluated, the determining module 1220 is specifically configured to:

Optionally, when obtaining the evaluation result of the speech to be evaluated according to the final evaluation score, the determining module 1220 is specifically configured to:

Optionally, the training mode for the evaluation module further includes:

a second obtaining module 1230, configured to obtain a standard voice text corresponding to each standard voice sample in a standard voice sample set and the standard voice sample set;

the simulation module 1240 is configured to perform voice simulation on each standard voice sample, respectively, to obtain each simulated voice sample;

a processing module 1250, configured to input each standard voice sample, the corresponding standard voice text, and the simulated voice sample into the evaluation model for training, recognize the phones corresponding to each standard voice and the phones corresponding to each simulated voice sample through the feature module of the evaluation model, determine the phone voice feature similarity between the pronunciation of each phone of each standard voice and each phone of the corresponding simulated voice sample through the phone network module of the evaluation model, obtain a simulated pronunciation score of the simulated voice sample, determine the fluent voice feature similarity corresponding to the simulated voice through the convolution network module of the evaluation model, obtain a simulated fluent score of the simulated voice sample, and according to each determined simulated pronunciation score and each determined simulated fluent score, and determining the final evaluation score corresponding to each simulated voice sample until the target function of the evaluation model converges to obtain the trained evaluation model, wherein the target function is the minimization of the cross entropy function between the simulated voice sample and the standard voice sample.

Optionally, the simulation module 1240 is specifically configured to:

Based on the above embodiments, referring to fig. 13, a schematic structural diagram of an electronic device in an embodiment of the present application is shown.

Embodiments of the present application provide an electronic device, which may include a processor 1310 (CPU), a memory 1320, an input device 1330, an output device 1340, and the like, wherein the input device 1330 may include a keyboard, a mouse, a touch screen, and the like, and the output device 1340 may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), and the like.

Memory 1320 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides the processor 1310 with program instructions and data stored in the memory 1320. In the embodiment of the present application, the memory 1320 may be used to store the program of any one of the speech evaluation methods in the embodiment of the present application.

The processor 1310 is used for executing any speech evaluation method in the embodiment of the present application according to the obtained program instructions by calling the program instructions stored in the memory 1320.

Based on the foregoing embodiments, in the embodiments of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the speech evaluation method in any of the above method embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A speech evaluation method, comprising:

acquiring a voice to be evaluated;

based on a trained evaluation model, with the speech to be evaluated as an input parameter, recognizing each phoneme of the speech to be evaluated, determining phoneme speech feature similarity between each phoneme and a corresponding preset standard phoneme, and determining fluent speech feature similarity corresponding to the speech to be evaluated according to the speech to be evaluated and the corresponding preset standard speech, wherein the determining of the phoneme speech feature similarity between each phoneme and the corresponding preset standard phoneme specifically comprises: respectively determining the corresponding phonon characteristics of each phonon; classifying the phones according to the determined phone characteristics, and respectively determining the preset phone categories to which the phones belong; respectively determining the phoneme voice feature similarity between each phoneme and a preset standard phoneme included in a corresponding preset phoneme category, wherein the phoneme represents the phoneme corresponding to the minimum unit of voice pronunciation;

2. The method of claim 1, further comprising: acquiring a voice text corresponding to a voice to be evaluated;

3. The method according to claim 1 or 2, wherein determining the evaluation result of the speech to be evaluated specifically comprises:

4. The method according to claim 3, wherein obtaining the evaluation result of the speech to be evaluated according to the final evaluation score specifically comprises:

5. The method according to claim 1 or 2, wherein the evaluation model is trained in the manner of:

respectively inputting the standard voice samples, the corresponding standard voice texts and the simulated voice samples into the evaluation model for training, identifying the phones corresponding to the standard voices and the phones corresponding to the simulated voice samples through a feature module of the evaluation model, determining the phone voice feature similarity between the pronunciation of each phone of each standard voice and each phone of the corresponding simulated voice sample through a phone network module of the evaluation model to obtain the simulated pronunciation score of the simulated voice sample, determining the fluent voice feature similarity corresponding to the simulated voice through a convolution network module of the evaluation model to obtain the simulated fluent score of the simulated voice sample, and determining the final evaluation score corresponding to each simulated voice sample according to each determined simulated pronunciation score and each determined simulated fluent score, and obtaining the trained evaluation model until the target function of the evaluation model converges, wherein the target function is the minimization of a cross entropy function between the simulated voice sample and the standard voice sample, the phononic network module consists of a 1-dimensional volume and 2 layers of GRUs, and the convolutional network module consists of a 1-dimensional volume and 2 layers of GRUs.

6. The method of claim 5, wherein performing speech simulation on each standard speech sample in the set of standard speech samples to obtain each simulated speech sample comprises:

according to a preset first change intensity coefficient, respectively carrying out sound quality change simulation on each standard voice sample to obtain a simulated voice sample after corresponding simulation, wherein the sound quality change simulation at least comprises the following processing modes or any combination: voice noise adding, spectrum noise reduction, spectrum distortion and base frequency adjustment;

and/or performing pitch change simulation on each standard voice sample according to a preset second change intensity coefficient to obtain a corresponding simulated voice sample, wherein the pitch change simulation at least comprises one or any combination of the following processing modes: voice deformation, lengthening and shortening of pronunciation time, spectrum deformation and fundamental frequency deformation.

7. A speech evaluation apparatus, comprising:

the first acquisition module is used for acquiring the voice to be evaluated;

the evaluation module is used for identifying each phoneme of the speech to be evaluated based on a trained evaluation model by taking the speech to be evaluated as an input parameter, determining phoneme speech feature similarity between each phoneme and a corresponding preset standard phoneme, and determining fluent speech feature similarity corresponding to the speech to be evaluated according to the speech to be evaluated and the corresponding preset standard speech, wherein when determining phoneme speech feature similarity between each phoneme and the corresponding preset standard phoneme, the evaluation module is specifically used for: respectively determining the corresponding phonon characteristics of each phonon; classifying the phones according to the determined phone characteristics, and respectively determining the preset phone categories to which the phones belong; respectively determining the phoneme voice feature similarity between each phoneme and a preset standard phoneme included in a corresponding preset phoneme category, wherein the phoneme represents the phoneme corresponding to the minimum unit of voice pronunciation;

8. The apparatus of claim 7, wherein the first obtaining module is further to: acquiring a voice text corresponding to a voice to be evaluated;

the evaluation module is specifically configured to:

9. The apparatus according to claim 7 or 8, wherein, when determining the evaluation result of the speech to be evaluated, the determining module is specifically configured to:

10. The apparatus according to claim 9, wherein when obtaining the evaluation result of the speech to be evaluated according to the final evaluation score, the determining module is specifically configured to:

11. The apparatus according to claim 7 or 8, wherein the training mode for the evaluation module further comprises:

a processing module, configured to input each standard voice sample, a corresponding standard voice text, and a simulated voice sample into the evaluation model for training, recognize a phone corresponding to each standard voice and a phone corresponding to each simulated voice sample through the feature module of the evaluation model, determine phone voice feature similarity between a phone of each standard voice and a phone of each phone of a corresponding simulated voice sample through the phone network module of the evaluation model, obtain a simulated phone score of the simulated voice sample, determine a smooth phone feature similarity corresponding to the simulated voice through the convolution network module of the evaluation model, obtain a simulated fluent phone score of the simulated voice sample, and determine each simulated phone score and each determined simulated fluent score according to the simulated phone score, and determining the final evaluation score corresponding to each simulated voice sample until the target function of the evaluation model converges, and obtaining the trained evaluation model, wherein the target function is the minimization of a cross entropy function between the simulated voice sample and the standard voice sample, the phononic network module consists of a 1-dimensional volume and 2 layers of GRUs, and the convolutional network module consists of a 1-dimensional volume and 2 layers of GRUs.

12. The apparatus of claim 11, wherein the simulation module is specifically configured to:

according to a preset first change intensity coefficient, respectively carrying out sound quality change simulation on each standard voice sample to obtain a simulated voice sample after corresponding simulation, wherein the sound quality change simulation at least comprises the following processing modes or any combination: voice noise adding, spectrum noise reducing, spectrum distortion and base frequency adjustment;

and/or performing pitch change simulation on each standard voice sample according to a preset second change intensity coefficient to obtain a corresponding simulated voice sample, wherein the pitch change simulation at least comprises one or any combination of the following processing modes: and voice deformation, lengthening and shortening of pronunciation time and deformation of fundamental frequency.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1-6 are implemented when the program is executed by the processor.

14. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a processor implementing the steps of the method of any one of claims 1 to 6.