CN112802494B

CN112802494B - Voice evaluation method, device, computer equipment and medium

Info

Publication number: CN112802494B
Application number: CN202110386211.3A
Authority: CN
Inventors: 赵明; 田科; 潘建伟; 吴中勤
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2021-07-16
Anticipated expiration: 2041-04-12
Also published as: CN112802494A

Abstract

The disclosure relates to a voice evaluation method, a device, computer equipment and a medium, wherein the voice evaluation method comprises the following steps: inputting the test text into a voice synthesis model, and acquiring a first voice corresponding to the test text output by the voice synthesis model; obtaining a first similarity of the first voice and the second voice according to the audio features of the first voice and the audio features of the second voice corresponding to the test text; and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result. Because the evaluation result of the second voice is known, the evaluation result of the first voice is determined according to the first similarity and the evaluation result of the second voice, so that the voice evaluation time is shortened, the interference of subjective factors of manual evaluation is reduced, and the accuracy of the evaluation result is improved, thereby improving the efficiency of voice evaluation.

Description

Voice evaluation method, device, computer equipment and medium

Technical Field

The present disclosure relates to the field of speech processing technologies, and in particular, to a speech evaluation method, apparatus, computer device, and medium.

Background

Text To Speech (TTS) technology can convert Text To Speech output. With the rapid development of the artificial intelligence industry, TTS is widely applied to scenes such as voice assistants, map navigation, vocal reading and the like, and the quality requirements of people on the voice output by TTS are higher and higher.

In the prior art, a manual evaluation mode is generally adopted, namely, the speech output by the TTS model is evaluated and scored through a human hearing test. For example: different auditors respectively Score the voice to be evaluated to obtain a Mean Opinion Score (MOS) value, the Score range is 0-5, and the larger the Score is, the better the voice quality is.

However, it is not efficient to evaluate the speech to be evaluated in a manual evaluation manner.

Disclosure of Invention

To solve the above technical problem or at least partially solve the above technical problem, the present disclosure provides a speech evaluation method, apparatus, computer device, and medium.

In a first aspect, the present disclosure provides a speech evaluation method, including:

inputting a test text into a voice synthesis model, and acquiring a first voice corresponding to the test text output by the voice synthesis model;

obtaining a first similarity of the first voice and the second voice according to the audio characteristics of the first voice and the audio characteristics of the second voice corresponding to the test text;

and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result.

Optionally, the audio features include: amplitude and frequency;

the obtaining a first similarity between the first voice and the second voice according to the audio feature of the first voice and the audio feature of the second voice corresponding to the test text includes:

acquiring first sound wave waveforms respectively corresponding to all first voice fragments of the first voice;

acquiring second sound wave waveforms respectively corresponding to all second voice fragments of the second voice;

acquiring a first result corresponding to the amplitude and a second result corresponding to the frequency according to the first sound wave waveform and the second sound wave waveform;

and obtaining the first similarity according to the first result and the second result.

Optionally, the obtaining a first result corresponding to an amplitude and a second result corresponding to a frequency according to the first acoustic waveform and the second acoustic waveform includes:

performing cross-comparison calculation on the first sound wave waveform and the second sound wave waveform to obtain a first result corresponding to the amplitude;

performing similarity Hash operation on the first sound wave shape to obtain a first similarity Hash operation result;

performing similarity hash operation on the second acoustic waveform to obtain a second similarity hash operation result;

and acquiring the intersection of the first similarity hash operation result and the second similarity hash operation result to obtain a second result corresponding to the frequency.

Optionally, the acquiring first sound wave waveforms respectively corresponding to all the voice segments of the first voice includes:

performing voice segmentation on the first voice to obtain at least two first voice segments;

performing Fourier transform on the at least two first voice segments respectively to obtain first sound wave waveforms corresponding to all voice segments of the first voice respectively;

the obtaining of the second acoustic waveforms corresponding to all the voice segments of the second voice includes:

performing voice segmentation on the second voice to obtain at least two second voice segments;

and respectively carrying out Fourier transform on the at least two second voice segments to obtain second sound wave waveforms respectively corresponding to all the voice segments of the second voice.

Optionally, the obtaining the first similarity according to the first result and the second result includes:

according to

Obtaining the first similarity;

wherein the content of the first and second substances,

is firstThe degree of similarity is such that,

in order to achieve the first result,

in order to achieve the second result,

is the number of first speech segments.

Optionally, the determining an evaluation result of the first speech according to the first similarity and a known second speech evaluation result includes:

and if the first similarity is larger than a first preset threshold value, determining that the evaluation result of the first voice is consistent with the evaluation result of the second voice, wherein the evaluation result of the second voice is high-quality voice or poor-quality voice.

Optionally, the method further includes:

if the first similarity is not larger than the first preset threshold, inputting the first voice into a voice evaluation model to obtain an evaluation score;

determining an evaluation result of the first voice according to the evaluation score;

the speech evaluating model outputs the evaluating score according to the scores of at least two evaluating dimensions, the evaluating score corresponding to the first speech is the weighted sum of the scores of the at least two evaluating dimensions, and the evaluating dimensions comprise at least two of the following: gulp, sentence break, mechanical voice, speech rate, and word stack.

Optionally, the determining an evaluation result of the first speech according to the evaluation score includes:

if the evaluation score is larger than a second preset threshold value, determining that the evaluation result of the first voice is a high-quality voice;

and if the evaluation score is smaller than a third preset threshold value, determining that the evaluation result of the first voice is poor voice.

Optionally, before obtaining the first similarity between the first voice and the second voice according to the audio feature of the first voice and the audio feature of the second voice corresponding to the test text, the method further includes:

recognizing the first voice according to an automatic voice recognition algorithm to generate a first text;

comparing the test text with the first text to obtain a second similarity;

and determining that the second similarity is greater than a fourth preset threshold.

Optionally, the method further includes:

and if the second similarity is not greater than the fourth preset threshold, determining that the evaluation result of the first voice is poor voice.

In a second aspect, the present disclosure provides a speech evaluation apparatus, including:

the acquisition module is used for inputting a test text into a voice synthesis model and acquiring a first voice corresponding to the test text output by the voice synthesis model;

the processing module is used for obtaining a first similarity of the first voice and the second voice according to the audio characteristics of the first voice and the audio characteristics of the second voice corresponding to the test text;

the processing module is further configured to determine an evaluation result of the first voice according to the first similarity and a known second voice evaluation result.

Optionally, the audio features include: amplitude and frequency;

the processing module is specifically configured to:

Optionally, the processing module is specifically configured to:

according to

Obtaining the first similarity;

wherein the content of the first and second substances,

in order to be the first degree of similarity,

in order to achieve the first result,

in order to achieve the second result,

is the number of first speech segments.

Optionally, the processing module is specifically configured to:

Optionally, the processing module is further configured to:

Optionally, the processing module is specifically configured to:

Optionally, the processing module is further configured to:

comparing the test text with the first text to obtain a second similarity;

Optionally, the processing module is further configured to:

In a third aspect, the present disclosure provides a computer device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of the first aspect when executing the computer program.

In a fourth aspect, the present disclosure provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

obtaining a first similarity of the first voice and the second voice according to the audio features of the first voice and the audio features of the second voice corresponding to the test text; and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result. Because the evaluation result of the second voice is known, the evaluation result of the first voice is determined according to the first similarity and the evaluation result of the second voice, so that the voice evaluation time is shortened, the interference of subjective factors of manual evaluation is reduced, and the accuracy of the evaluation result is improved, thereby improving the efficiency of voice evaluation.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a speech evaluation method according to the present disclosure;

FIG. 2 is a schematic flow chart diagram illustrating another embodiment of a speech assessment method provided by the present disclosure;

FIG. 3 is a schematic flow chart diagram illustrating another embodiment of a speech evaluation method according to the present disclosure;

FIG. 4 is a schematic flow chart diagram illustrating another embodiment of a speech evaluation method according to the present disclosure;

FIG. 5 is a schematic flow chart diagram illustrating another embodiment of a speech evaluation method according to the present disclosure;

FIG. 6 is a schematic flow chart diagram illustrating yet another embodiment of a speech evaluation method according to the present disclosure;

fig. 7 is a schematic structural diagram of a speech evaluation device provided by the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

TTS technology can convert any text information into audible speech information. Existing TTS techniques typically utilize neural network models to achieve text-to-speech conversion. After the TTS model outputs speech, the quality of the synthesized speech needs to be evaluated. At present, the speech output by a TTS model is generally evaluated and scored in a human hearing test mode, however, manual evaluation often has subjectivity, so that evaluation scores are inaccurate, and the efficiency of manual evaluation is not high.

The present disclosure provides a speech evaluation method, including: inputting the test text into a voice synthesis model, and acquiring a first voice corresponding to the test text output by the voice synthesis model; obtaining a first similarity of the first voice and the second voice according to the audio features of the first voice and the audio features of the second voice corresponding to the test text; and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result. Because the evaluation result of the second voice is known, the evaluation result of the first voice is determined according to the first similarity and the evaluation result of the second voice, so that the voice evaluation time is shortened, the interference of subjective factors of manual evaluation is reduced, and the accuracy of the evaluation result is improved, thereby improving the efficiency of voice evaluation.

The technical solutions of the present disclosure are described in several specific embodiments, and the same or similar concepts may be referred to one another, and are not described in detail in each place.

Fig. 1 is a schematic flow chart of an embodiment of a speech evaluation method provided in an embodiment of the present disclosure, as shown in fig. 1, the method of the embodiment includes:

s101: and inputting the test text into a voice synthesis model, and acquiring a first voice corresponding to the test text output by the voice synthesis model.

The speech synthesis model can be a model built based on a neural network, and the neural network includes but is not limited to the following: convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and temporal recursive neural networks (LSTMs), to which the present disclosure does not limit.

Taking a test text 'Dajiahao, i is minired from Beijing' as an example, inputting the test text into a TTS model, and acquiring a first voice 'da 4 jia1 hao3, wo3 shi4 lai2 zi4 bei3 sting 1 de0 xiao3 hong 2' corresponding to the test text output by the TTS model.

S102: and obtaining a first similarity of the first voice and the second voice according to the audio characteristics of the first voice and the audio characteristics of the second voice corresponding to the test text.

The second voice can obtain the voice of the user reading the test text in a recording mode, and can also input the test text into another TTS model to obtain the voice corresponding to the test text output by the TTS model.

Optionally, the audio features include: amplitude and frequency.

One possible implementation is: as shown in figure 2 of the drawings, in which,

s1021: and acquiring first sound wave waveforms respectively corresponding to all first voice segments of the first voice.

And performing voice segmentation on the first voice to obtain at least two first voice segments. For example: the first Voice is subjected to Voice segmentation through Voice Activity Detection (VAD) algorithm to obtain at least two first Voice segments. VAD is generally used to identify silence segments in audio data and extract pronunciation segments in the audio data.

And respectively carrying out Fourier transform on the at least two first voice segments to obtain first sound wave waveforms respectively corresponding to all the voice segments of the first voice. Taking the example that the first voice is divided into N first voice segments, the fourier transform is performed on the N voice segments respectively to obtain N first sound wave waveforms

，

The waveform is the ith first sound wave, N is an integer greater than or equal to 2, and i is an integer greater than or equal to 1 and less than or equal to N.

S1022: and acquiring second acoustic waveforms respectively corresponding to all second voice segments of the second voice.

And performing voice segmentation on the second voice to obtain at least two second voice segments. Similarly, the second voice is voice-segmented by VAD to obtain at least two second voice segments. Because the test text corresponding to the second voice is the same as the test text corresponding to the first voice, the number of the second voice segments obtained by performing voice segmentation on the second voice through VAD is the same as the number of the first voice segments. Or performing voice segmentation on the second voice according to the number of the first voice segments to obtain at least two second voice segments, wherein the number of the second voice segments is equal to the number of the first voice segments.

And respectively carrying out Fourier transform on the at least two second voice segments to obtain second sound waveforms respectively corresponding to all the voice segments of the second voice. Taking the example that the second voice is divided into N second voice segments, the Fourier transform is respectively carried out on the N voice segments to obtain N second acoustic waveforms

，

The second acoustic waveform is the ith, N is an integer greater than or equal to 2, and i is an integer greater than or equal to 1 and less than or equal to N.

S1023: and acquiring a first result corresponding to the amplitude and a second result corresponding to the frequency according to the first sound wave waveform and the second sound wave waveform.

One possible implementation is: as shown in figure 3 of the drawings,

s10231: and carrying out cross-over ratio calculation on the first sound wave waveform and the second sound wave waveform to obtain a first result corresponding to the amplitude.

According to

And determining the similarity between the amplitude of the first sound wave waveform and the amplitude of the second sound wave waveform to obtain a first result corresponding to the amplitude.

Wherein the content of the first and second substances,

in order to achieve the first result,

for the ith first acoustic waveform,

for the ith second acoustic waveform,

is the area of the ith first acoustic waveform,

and the area of the ith second sound wave waveform is shown, i is an integer which is more than or equal to 1 and less than or equal to N, and N is the number of the first voice segments.

S10232: and performing similarity Hash operation on the first sound wave waveform to obtain a first similarity Hash operation result.

A hash-based similarity detection (simhash) operation is to generate a 64-bit signature by defining a waveform as 1 upward and 0 downward according to the shape of a waveform of a sound wave.

S10233: and performing similarity Hash operation on the second acoustic waveform to obtain a second similarity Hash operation result.

S10234: and acquiring the intersection of the first similarity hash operation result and the second similarity hash operation result to obtain a second result corresponding to the frequency.

According to

And determining the similarity of the frequency of the first sound wave form and the frequency of the second sound wave form to obtain a second result corresponding to the frequency.

Wherein the content of the first and second substances,

in order to achieve the second result,

for the ith first acoustic waveform,

for the ith second acoustic waveform,

is a first simhash operation result corresponding to the ith first acoustic waveform,

as the second simhash operation result corresponding to the ith second acoustic waveform,

the Hamming (Hamming) distance between a first simhash operation result corresponding to the ith first sound wave waveform and a second simhash operation result corresponding to the ith second sound wave waveform is set, i is an integer which is greater than or equal to 1 and less than or equal to N, and N is the number of the first voice segments. The hamming distance refers to the number of characters that differ at the same position in two character strings of the same length.

S1024: and obtaining a first similarity according to the first result and the second result.

Optionally, according to

Obtaining a first similarity;

wherein the content of the first and second substances,

in order to be the first degree of similarity,

in order to achieve the first result,

in order to achieve the second result,

is the number of first speech segments.

S103: and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result.

One possible implementation is: as shown in figure 4 of the drawings,

s1031: and if the first similarity is larger than a first preset threshold value, determining that the evaluation result of the first voice is consistent with the evaluation result of the second voice.

And the evaluation result of the second voice is high-quality voice or poor-quality voice. For example: and if the first similarity is greater than 0.8, determining that the evaluation result of the first voice is poor voice.

Optionally, fig. 4 may also be based on the embodiments shown in fig. 2 or fig. 3.

Another possible implementation is:

and if the first similarity is larger than or equal to a first preset threshold, determining that the evaluation result of the first voice is consistent with the evaluation result of the second voice, wherein the evaluation result of the second voice is high-quality voice or poor-quality voice.

In this embodiment, a test text is input into a speech synthesis model, and a first speech corresponding to the test text output by the speech synthesis model is obtained; obtaining a first similarity of the first voice and the second voice according to the audio features of the first voice and the audio features of the second voice corresponding to the test text; and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result. Because the evaluation result of the second voice is known, the evaluation result of the first voice is determined according to the first similarity and the evaluation result of the second voice, so that the voice evaluation time is shortened, the interference of subjective factors of manual evaluation is reduced, and the accuracy of the evaluation result is improved, thereby improving the efficiency of voice evaluation.

Optionally, the audio features include: the length of sound;

another possible implementation manner of S102 is:

s1021': and acquiring the difference ratio of the sound lengths of the first voice and the second voice according to the sound length of the first voice and the sound length of the second voice.

Optionally, according to

And acquiring the sound length difference ratio of the first voice and the second voice. Wherein the content of the first and second substances,

is a soundThe ratio of the length difference to the length difference,

is the total duration of the first speech sound,

is the total duration of the second speech sound,

and S1022': and obtaining a first similarity of the first voice and the second voice according to the sound length difference ratio.

S1023': and obtaining the first similarity of the first voice and the second voice according to the length difference ratio and the mapping relation between the predefined length difference ratio and the first similarity.

Fig. 5 is a schematic flow chart of another embodiment of a speech evaluation method provided by the embodiment of the present disclosure, where fig. 5 is based on the embodiment shown in fig. 4, and further, after S103, the method further includes:

s104: and if the first similarity is not greater than a first preset threshold, inputting the first voice into a voice evaluation model to obtain an evaluation score.

One possible implementation is: and if the first similarity is smaller than a first preset threshold value, inputting the first voice into a voice evaluation model to obtain an evaluation score.

Another possible implementation is: and if the first similarity is less than or equal to a first preset threshold value, inputting the first voice into a voice evaluation model to obtain an evaluation score.

The speech evaluation model outputs evaluation scores according to the scores of at least two evaluation dimensions, the evaluation score corresponding to the first speech is the weighted sum of the scores of the at least two evaluation dimensions, and the evaluation dimensions comprise at least two of the following types: gulp, sentence break, mechanical voice, speech rate, and word stack.

Inputting the first voice into a voice evaluation model, obtaining scores of at least two evaluation dimensions, and obtaining a score according to the scores

And obtaining an evaluation score. Wherein the content of the first and second substances,

is an evaluation score for the first speech,

for the weight of the nth evaluation dimension,

for the score of the nth evaluation dimension,

and

are integers of 0 to 5 inclusive. For example: the voice evaluation model scores the first voice according to five evaluation dimensions (swallow, sentence break error, mechanical voice, voice speed and overlapping characters) to obtain scores of the first voice, namely 'no swallow', 'no sentence break error', 'no mechanical voice', 'normal voice speed' and 'no overlapping characters', as shown in table 1, the scores of all the evaluation dimensions are multiplied by the weight of all the evaluation dimensions to obtain 'single final score', then the 'single final scores' are added, and the evaluation score of the first voice is output to be 2.4. The weight of each evaluation dimension can be set according to specific requirements, which is not limited by the present disclosure.

TABLE 1

	Without swallowing	No sentence break error	Without mechanical sound	Normal speech rate	Without overlapping characters
						Score of	3	2	2	3	3
Weight of	0.2	0.2	0.4	0.1	0.1
						Single term final score	0.6	0.4	0.8	0.3	0.3

The speech evaluation model is obtained by training a neural network in advance by using a sample set, wherein the sample set comprises speech which is generated by a TTS model and has swallow, sentence break error, mechanical speech, high and low speech speed, and overlapping characters, and the corresponding evaluation dimension scores. The speech evaluation model comprises an acoustic model and a scoring model, wherein the acoustic model can adopt a Convolutional Neural Network (CNN) model, and the scoring model can adopt a multi-classification Logistic Regression (LR) model. The voice evaluation model identifies the degree of each voice in the sample set, including swallow, sentence break error, mechanical voice, fast and slow voice rate and overlapping character, and classifies the voice to obtain a score corresponding to each evaluation dimension, and finally, the scores of each evaluation dimension are weighted and summed to obtain an evaluation score of 0-5.

S105: and determining an evaluation result of the first voice according to the evaluation score.

One possible implementation is: judging whether the evaluation score is larger than a second preset threshold value, if so, determining that the evaluation result of the first voice is a high-quality voice; if not, judging whether the evaluation score is smaller than a third preset threshold value, and if so, determining that the evaluation result of the first voice is poor voice; and if not, determining the evaluation result of the first voice in a manual evaluation mode.

Another possible implementation is: judging whether the evaluation score is greater than or equal to a second preset threshold, if so, determining that the evaluation result of the first voice is a high-quality voice; if not, judging whether the evaluation score is less than or equal to a third preset threshold, and if so, determining that the evaluation result of the first voice is poor voice; and if not, determining the evaluation result of the first voice in a manual evaluation mode. For example, a second preset threshold value is set to be 4, a third preset threshold value is set to be 2, and if the evaluation score is greater than or equal to 4, the evaluation result of the first voice is a high-quality voice; and if the evaluation score is less than or equal to 2, the evaluation result of the first voice is poor voice, and if the evaluation score is more than 2 and less than 4, the evaluation result of the first voice is determined through manual evaluation.

In this embodiment, if the first similarity is not greater than the first preset threshold, the first voice is input into the voice evaluation model to obtain an evaluation score, and an evaluation result of the first voice is determined according to the evaluation score. Because the voice evaluation model is a pre-trained neural network model for scoring the first voice based on a plurality of evaluation dimensions, the voice evaluation model is used for evaluating the first voice, the speed of voice evaluation is improved, and the accuracy and comprehensiveness of a voice evaluation result are improved, so that the efficiency of voice evaluation is improved.

Optionally, if the evaluation result of the first speech is poor-quality audio, the evaluation results of the first speech and the first speech are used as training samples, and the training samples are input into the TTS model for optimizing the TTS model, so that the robustness of the TTS model is improved.

Fig. 6 is a schematic flow chart of another embodiment of a speech evaluation method provided by the present disclosure, where fig. 6 is based on any one of the embodiments shown in fig. 1 to fig. 5, and before S102, the method may further include the following steps:

s601: and recognizing the first voice according to an automatic voice recognition algorithm to generate a first text.

An Automatic Speech Recognition (ASR) algorithm takes Speech as a research object, and converts a Speech signal into a corresponding text output through Speech signal processing and pattern Recognition.

S602: and comparing the test text with the first text to obtain a second similarity.

One possible implementation is: and comparing the test text with the first text according to a Levenshtein (Levenshtein) distance method to obtain a second similarity.

The Levenshtein distance refers to the minimum number of editing operations required for converting one of two character strings into the other, and the editing operations mainly comprise: inserting a character, deleting a character, and replacing a character with another character. The smaller the Levenshtein distance between two strings, the more similar they are represented.

For example, the content of the test text is string1, the content of the first text is string2, string1 needs M editing operations to convert to string2, according to which

And acquiring a second similarity. Wherein M is the Levenshtein distance between the test text and the first text,

to test the length of the content of the text,

is the length of the content of the first text.

S603: and judging whether the second similarity is larger than a fourth preset threshold value.

One possible implementation is: judging whether the second similarity is greater than a fourth preset threshold, if not, executing S102; if yes, go to S604.

Another possible implementation is: judging whether the second similarity is greater than or equal to a fourth preset threshold, if not, executing S102; if yes, go to S604.

S604: and determining that the evaluation result of the first voice is poor voice.

For example: the test text is 'big good, i.e. small red from Beijing', the first voice is recognized according to an ASR algorithm, the generated first text is 'big good, i.e. small red', the test text is compared with the first text, the second similarity is obtained to be 0.67, the fourth preset threshold is 0.9, the second similarity is not greater than the fourth preset threshold, and then the evaluation result of the first voice is determined to be poor voice.

In this embodiment, before the first similarity between the first speech and the second speech is obtained according to the audio feature of the first speech and the audio feature of the second speech corresponding to the test text, the first speech is recognized according to an automatic speech recognition algorithm to generate a first text, the test text is compared with the first text to obtain the second similarity, the situation that 'stuttering' exists in the first speech can be recognized, poor-quality speech can be recognized more quickly, and therefore the speech evaluation efficiency is improved.

Optionally, before S601, the method may further include: and acquiring a tone output log of the first voice, and determining that the tone output log of the first voice is consistent with a standard tone output log corresponding to the test text. And if the tone output log of the first voice is inconsistent with the standard tone output log corresponding to the test text, determining that the evaluation result of the first voice is poor-quality audio.

Taking the test text 'Dajiahao, i is minired from Beijing' as an example, the standard tone output log corresponding to the test text is 'da 4 jia1 hao3, wo3 shi4 lai2 zi4 bei3 jing1 de0 xiao3 hong 2', if the tone output log of the first voice is 'da 4 jia4 hao4, wo1 shi1 lai1 zi3 bei3 jing1 de0 xiao3 hong 2', and the tone output log of the first voice is inconsistent with the standard tone output log corresponding to the test text, the evaluation result of the first voice is determined to be poor-quality audio.

Fig. 7 is a schematic structural diagram of a speech evaluation apparatus provided in the embodiment of the present disclosure, where the apparatus of the embodiment includes: an acquisition module 701 and a processing module 702.

The acquiring module 701 is configured to input the test text into the speech synthesis model, and acquire a first speech corresponding to the test text output by the speech synthesis model;

the processing module 702 is configured to obtain a first similarity between a first voice and a second voice according to an audio feature of the first voice and an audio feature of the second voice corresponding to the test text;

the processing module 702 is further configured to determine an evaluation result of the first speech according to the first similarity and the known second speech evaluation result.

Optionally, the audio features include: amplitude and frequency;

the processing module 702 is specifically configured to:

acquiring first sound wave waveforms respectively corresponding to all first voice segments of a first voice;

acquiring second sound waveforms respectively corresponding to all second voice segments of the second voice;

and obtaining a first similarity according to the first result and the second result.

Optionally, the processing module 702 is specifically configured to:

carrying out similarity Hash operation on the second acoustic waveform to obtain a second similarity Hash operation result;

Optionally, the processing module 702 is specifically configured to:

fourier transformation is respectively carried out on at least two first voice segments to obtain first sound wave waveforms respectively corresponding to all voice segments of the first voice;

and respectively carrying out Fourier transform on the at least two second voice segments to obtain second sound waveforms respectively corresponding to all the voice segments of the second voice.

Optionally, the processing module 702 is specifically configured to:

according to

Obtaining a first similarity;

wherein the content of the first and second substances,

in order to be the first degree of similarity,

in order to achieve the first result,

in order to achieve the second result,

is the number of first speech segments.

Optionally, the processing module 702 is specifically configured to:

Optionally, the processing module 702 is further configured to:

if the first similarity is not larger than a first preset threshold, inputting the first voice into a voice evaluation model to obtain an evaluation score;

Optionally, the processing module 702 is specifically configured to:

if the evaluation score is larger than a second preset threshold, determining that the evaluation result of the first voice is a high-quality voice;

and if the evaluation score is smaller than a third preset threshold, determining that the evaluation result of the first voice is poor voice.

Optionally, the processing module 702 is further configured to:

comparing the test text with the first text to obtain a second similarity;

Optionally, the processing module 702 is further configured to:

The apparatus of this embodiment may be used to implement the technical solution of any one of the method embodiments shown in fig. 1 to fig. 6, and the implementation principle and the technical effect are similar, which are not described herein again.

The disclosed embodiment provides a computer device, including: the memory, the processor, and the computer program stored in the memory and capable of running on the processor, where the processor executes the computer program to implement the technical solution of any one of the method embodiments shown in fig. 1 to 6, and the implementation principle and the technical effect are similar, and are not described herein again.

The present disclosure also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the solution of the method embodiment shown in any one of fig. 1 to 6.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech evaluation method, comprising:

determining an evaluation result of the first voice according to the first similarity and a known second voice evaluation result;

wherein the audio features include: amplitude and frequency;

obtaining the first similarity according to the first result and the second result;

the obtaining a first result corresponding to the amplitude and a second result corresponding to the frequency according to the first acoustic waveform and the second acoustic waveform includes:

2. The method according to claim 1, wherein the obtaining of the first acoustic waveforms corresponding to all the voice segments of the first voice comprises:

3. The method according to claim 1 or 2, wherein the deriving the first similarity from the first result and the second result comprises:

according to

Obtaining the first similarity;

wherein the content of the first and second substances,

in order to be the first degree of similarity,

in order to achieve the first result,

in order to achieve the second result,

is the number of first speech segments.

4. The method according to claim 1 or 2, wherein determining the evaluation result of the first speech according to the first similarity and the known second speech evaluation result comprises:

5. The method of claim 4, further comprising:

6. The method according to claim 5, wherein the determining an evaluation result of the first speech according to the evaluation score comprises:

7. The method according to claim 1 or 2, wherein before obtaining the first similarity between the first speech and the second speech according to the audio feature of the first speech and the audio feature of the second speech corresponding to the test text, the method further comprises:

comparing the test text with the first text to obtain a second similarity;

8. The method of claim 7, further comprising:

9. A speech evaluation apparatus, comprising:

the processing module is further used for determining an evaluation result of the first voice according to the first similarity and a known second voice evaluation result;

wherein the audio features include: amplitude and frequency;

the processing module is specifically configured to:

10. A computer device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to any one of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.