CN112802494B - Voice evaluation method, device, computer equipment and medium - Google Patents

Voice evaluation method, device, computer equipment and medium Download PDF

Info

Publication number
CN112802494B
CN112802494B CN202110386211.3A CN202110386211A CN112802494B CN 112802494 B CN112802494 B CN 112802494B CN 202110386211 A CN202110386211 A CN 202110386211A CN 112802494 B CN112802494 B CN 112802494B
Authority
CN
China
Prior art keywords
voice
result
similarity
evaluation
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110386211.3A
Other languages
Chinese (zh)
Other versions
CN112802494A (en
Inventor
赵明
田科
潘建伟
吴中勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co Ltd filed Critical Beijing Century TAL Education Technology Co Ltd
Priority to CN202110386211.3A priority Critical patent/CN112802494B/en
Publication of CN112802494A publication Critical patent/CN112802494A/en
Application granted granted Critical
Publication of CN112802494B publication Critical patent/CN112802494B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Abstract

The disclosure relates to a voice evaluation method, a device, computer equipment and a medium, wherein the voice evaluation method comprises the following steps: inputting the test text into a voice synthesis model, and acquiring a first voice corresponding to the test text output by the voice synthesis model; obtaining a first similarity of the first voice and the second voice according to the audio features of the first voice and the audio features of the second voice corresponding to the test text; and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result. Because the evaluation result of the second voice is known, the evaluation result of the first voice is determined according to the first similarity and the evaluation result of the second voice, so that the voice evaluation time is shortened, the interference of subjective factors of manual evaluation is reduced, and the accuracy of the evaluation result is improved, thereby improving the efficiency of voice evaluation.

Description

Voice evaluation method, device, computer equipment and medium
Technical Field
The present disclosure relates to the field of speech processing technologies, and in particular, to a speech evaluation method, apparatus, computer device, and medium.
Background
Text To Speech (TTS) technology can convert Text To Speech output. With the rapid development of the artificial intelligence industry, TTS is widely applied to scenes such as voice assistants, map navigation, vocal reading and the like, and the quality requirements of people on the voice output by TTS are higher and higher.
In the prior art, a manual evaluation mode is generally adopted, namely, the speech output by the TTS model is evaluated and scored through a human hearing test. For example: different auditors respectively Score the voice to be evaluated to obtain a Mean Opinion Score (MOS) value, the Score range is 0-5, and the larger the Score is, the better the voice quality is.
However, it is not efficient to evaluate the speech to be evaluated in a manual evaluation manner.
Disclosure of Invention
To solve the above technical problem or at least partially solve the above technical problem, the present disclosure provides a speech evaluation method, apparatus, computer device, and medium.
In a first aspect, the present disclosure provides a speech evaluation method, including:
inputting a test text into a voice synthesis model, and acquiring a first voice corresponding to the test text output by the voice synthesis model;
obtaining a first similarity of the first voice and the second voice according to the audio characteristics of the first voice and the audio characteristics of the second voice corresponding to the test text;
and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result.
Optionally, the audio features include: amplitude and frequency;
the obtaining a first similarity between the first voice and the second voice according to the audio feature of the first voice and the audio feature of the second voice corresponding to the test text includes:
acquiring first sound wave waveforms respectively corresponding to all first voice fragments of the first voice;
acquiring second sound wave waveforms respectively corresponding to all second voice fragments of the second voice;
acquiring a first result corresponding to the amplitude and a second result corresponding to the frequency according to the first sound wave waveform and the second sound wave waveform;
and obtaining the first similarity according to the first result and the second result.
Optionally, the obtaining a first result corresponding to an amplitude and a second result corresponding to a frequency according to the first acoustic waveform and the second acoustic waveform includes:
performing cross-comparison calculation on the first sound wave waveform and the second sound wave waveform to obtain a first result corresponding to the amplitude;
performing similarity Hash operation on the first sound wave shape to obtain a first similarity Hash operation result;
performing similarity hash operation on the second acoustic waveform to obtain a second similarity hash operation result;
and acquiring the intersection of the first similarity hash operation result and the second similarity hash operation result to obtain a second result corresponding to the frequency.
Optionally, the acquiring first sound wave waveforms respectively corresponding to all the voice segments of the first voice includes:
performing voice segmentation on the first voice to obtain at least two first voice segments;
performing Fourier transform on the at least two first voice segments respectively to obtain first sound wave waveforms corresponding to all voice segments of the first voice respectively;
the obtaining of the second acoustic waveforms corresponding to all the voice segments of the second voice includes:
performing voice segmentation on the second voice to obtain at least two second voice segments;
and respectively carrying out Fourier transform on the at least two second voice segments to obtain second sound wave waveforms respectively corresponding to all the voice segments of the second voice.
Optionally, the obtaining the first similarity according to the first result and the second result includes:
according to
Figure 615260DEST_PATH_IMAGE001
Obtaining the first similarity;
wherein the content of the first and second substances,
Figure 79739DEST_PATH_IMAGE002
is firstThe degree of similarity is such that,
Figure 14197DEST_PATH_IMAGE003
in order to achieve the first result,
Figure 273140DEST_PATH_IMAGE004
in order to achieve the second result,
Figure 290119DEST_PATH_IMAGE005
is the number of first speech segments.
Optionally, the determining an evaluation result of the first speech according to the first similarity and a known second speech evaluation result includes:
and if the first similarity is larger than a first preset threshold value, determining that the evaluation result of the first voice is consistent with the evaluation result of the second voice, wherein the evaluation result of the second voice is high-quality voice or poor-quality voice.
Optionally, the method further includes:
if the first similarity is not larger than the first preset threshold, inputting the first voice into a voice evaluation model to obtain an evaluation score;
determining an evaluation result of the first voice according to the evaluation score;
the speech evaluating model outputs the evaluating score according to the scores of at least two evaluating dimensions, the evaluating score corresponding to the first speech is the weighted sum of the scores of the at least two evaluating dimensions, and the evaluating dimensions comprise at least two of the following: gulp, sentence break, mechanical voice, speech rate, and word stack.
Optionally, the determining an evaluation result of the first speech according to the evaluation score includes:
if the evaluation score is larger than a second preset threshold value, determining that the evaluation result of the first voice is a high-quality voice;
and if the evaluation score is smaller than a third preset threshold value, determining that the evaluation result of the first voice is poor voice.
Optionally, before obtaining the first similarity between the first voice and the second voice according to the audio feature of the first voice and the audio feature of the second voice corresponding to the test text, the method further includes:
recognizing the first voice according to an automatic voice recognition algorithm to generate a first text;
comparing the test text with the first text to obtain a second similarity;
and determining that the second similarity is greater than a fourth preset threshold.
Optionally, the method further includes:
and if the second similarity is not greater than the fourth preset threshold, determining that the evaluation result of the first voice is poor voice.
In a second aspect, the present disclosure provides a speech evaluation apparatus, including:
the acquisition module is used for inputting a test text into a voice synthesis model and acquiring a first voice corresponding to the test text output by the voice synthesis model;
the processing module is used for obtaining a first similarity of the first voice and the second voice according to the audio characteristics of the first voice and the audio characteristics of the second voice corresponding to the test text;
the processing module is further configured to determine an evaluation result of the first voice according to the first similarity and a known second voice evaluation result.
Optionally, the audio features include: amplitude and frequency;
the processing module is specifically configured to:
acquiring first sound wave waveforms respectively corresponding to all first voice fragments of the first voice;
acquiring second sound wave waveforms respectively corresponding to all second voice fragments of the second voice;
acquiring a first result corresponding to the amplitude and a second result corresponding to the frequency according to the first sound wave waveform and the second sound wave waveform;
and obtaining the first similarity according to the first result and the second result.
Optionally, the processing module is specifically configured to:
performing cross-comparison calculation on the first sound wave waveform and the second sound wave waveform to obtain a first result corresponding to the amplitude;
performing similarity Hash operation on the first sound wave shape to obtain a first similarity Hash operation result;
performing similarity hash operation on the second acoustic waveform to obtain a second similarity hash operation result;
and acquiring the intersection of the first similarity hash operation result and the second similarity hash operation result to obtain a second result corresponding to the frequency.
Optionally, the processing module is specifically configured to:
performing voice segmentation on the first voice to obtain at least two first voice segments;
performing Fourier transform on the at least two first voice segments respectively to obtain first sound wave waveforms corresponding to all voice segments of the first voice respectively;
performing voice segmentation on the second voice to obtain at least two second voice segments;
and respectively carrying out Fourier transform on the at least two second voice segments to obtain second sound wave waveforms respectively corresponding to all the voice segments of the second voice.
Optionally, the processing module is specifically configured to:
according to
Figure 292710DEST_PATH_IMAGE006
Obtaining the first similarity;
wherein the content of the first and second substances,
Figure 81675DEST_PATH_IMAGE002
in order to be the first degree of similarity,
Figure 777098DEST_PATH_IMAGE003
in order to achieve the first result,
Figure 284303DEST_PATH_IMAGE004
in order to achieve the second result,
Figure 90585DEST_PATH_IMAGE005
is the number of first speech segments.
Optionally, the processing module is specifically configured to:
and if the first similarity is larger than a first preset threshold value, determining that the evaluation result of the first voice is consistent with the evaluation result of the second voice, wherein the evaluation result of the second voice is high-quality voice or poor-quality voice.
Optionally, the processing module is further configured to:
if the first similarity is not larger than the first preset threshold, inputting the first voice into a voice evaluation model to obtain an evaluation score;
determining an evaluation result of the first voice according to the evaluation score;
the speech evaluating model outputs the evaluating score according to the scores of at least two evaluating dimensions, the evaluating score corresponding to the first speech is the weighted sum of the scores of the at least two evaluating dimensions, and the evaluating dimensions comprise at least two of the following: gulp, sentence break, mechanical voice, speech rate, and word stack.
Optionally, the processing module is specifically configured to:
if the evaluation score is larger than a second preset threshold value, determining that the evaluation result of the first voice is a high-quality voice;
and if the evaluation score is smaller than a third preset threshold value, determining that the evaluation result of the first voice is poor voice.
Optionally, the processing module is further configured to:
recognizing the first voice according to an automatic voice recognition algorithm to generate a first text;
comparing the test text with the first text to obtain a second similarity;
and determining that the second similarity is greater than a fourth preset threshold.
Optionally, the processing module is further configured to:
and if the second similarity is not greater than the fourth preset threshold, determining that the evaluation result of the first voice is poor voice.
In a third aspect, the present disclosure provides a computer device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of the first aspect when executing the computer program.
In a fourth aspect, the present disclosure provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
obtaining a first similarity of the first voice and the second voice according to the audio features of the first voice and the audio features of the second voice corresponding to the test text; and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result. Because the evaluation result of the second voice is known, the evaluation result of the first voice is determined according to the first similarity and the evaluation result of the second voice, so that the voice evaluation time is shortened, the interference of subjective factors of manual evaluation is reduced, and the accuracy of the evaluation result is improved, thereby improving the efficiency of voice evaluation.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a speech evaluation method according to the present disclosure;
FIG. 2 is a schematic flow chart diagram illustrating another embodiment of a speech assessment method provided by the present disclosure;
FIG. 3 is a schematic flow chart diagram illustrating another embodiment of a speech evaluation method according to the present disclosure;
FIG. 4 is a schematic flow chart diagram illustrating another embodiment of a speech evaluation method according to the present disclosure;
FIG. 5 is a schematic flow chart diagram illustrating another embodiment of a speech evaluation method according to the present disclosure;
FIG. 6 is a schematic flow chart diagram illustrating yet another embodiment of a speech evaluation method according to the present disclosure;
fig. 7 is a schematic structural diagram of a speech evaluation device provided by the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
TTS technology can convert any text information into audible speech information. Existing TTS techniques typically utilize neural network models to achieve text-to-speech conversion. After the TTS model outputs speech, the quality of the synthesized speech needs to be evaluated. At present, the speech output by a TTS model is generally evaluated and scored in a human hearing test mode, however, manual evaluation often has subjectivity, so that evaluation scores are inaccurate, and the efficiency of manual evaluation is not high.
The present disclosure provides a speech evaluation method, including: inputting the test text into a voice synthesis model, and acquiring a first voice corresponding to the test text output by the voice synthesis model; obtaining a first similarity of the first voice and the second voice according to the audio features of the first voice and the audio features of the second voice corresponding to the test text; and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result. Because the evaluation result of the second voice is known, the evaluation result of the first voice is determined according to the first similarity and the evaluation result of the second voice, so that the voice evaluation time is shortened, the interference of subjective factors of manual evaluation is reduced, and the accuracy of the evaluation result is improved, thereby improving the efficiency of voice evaluation.
The technical solutions of the present disclosure are described in several specific embodiments, and the same or similar concepts may be referred to one another, and are not described in detail in each place.
Fig. 1 is a schematic flow chart of an embodiment of a speech evaluation method provided in an embodiment of the present disclosure, as shown in fig. 1, the method of the embodiment includes:
s101: and inputting the test text into a voice synthesis model, and acquiring a first voice corresponding to the test text output by the voice synthesis model.
The speech synthesis model can be a model built based on a neural network, and the neural network includes but is not limited to the following: convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and temporal recursive neural networks (LSTMs), to which the present disclosure does not limit.
Taking a test text 'Dajiahao, i is minired from Beijing' as an example, inputting the test text into a TTS model, and acquiring a first voice 'da 4 jia1 hao3, wo3 shi4 lai2 zi4 bei3 sting 1 de0 xiao3 hong 2' corresponding to the test text output by the TTS model.
S102: and obtaining a first similarity of the first voice and the second voice according to the audio characteristics of the first voice and the audio characteristics of the second voice corresponding to the test text.
The second voice can obtain the voice of the user reading the test text in a recording mode, and can also input the test text into another TTS model to obtain the voice corresponding to the test text output by the TTS model.
Optionally, the audio features include: amplitude and frequency.
One possible implementation is: as shown in figure 2 of the drawings, in which,
s1021: and acquiring first sound wave waveforms respectively corresponding to all first voice segments of the first voice.
And performing voice segmentation on the first voice to obtain at least two first voice segments. For example: the first Voice is subjected to Voice segmentation through Voice Activity Detection (VAD) algorithm to obtain at least two first Voice segments. VAD is generally used to identify silence segments in audio data and extract pronunciation segments in the audio data.
And respectively carrying out Fourier transform on the at least two first voice segments to obtain first sound wave waveforms respectively corresponding to all the voice segments of the first voice. Taking the example that the first voice is divided into N first voice segments, the fourier transform is performed on the N voice segments respectively to obtain N first sound wave waveforms
Figure 999635DEST_PATH_IMAGE007
Figure 334801DEST_PATH_IMAGE007
The waveform is the ith first sound wave, N is an integer greater than or equal to 2, and i is an integer greater than or equal to 1 and less than or equal to N.
S1022: and acquiring second acoustic waveforms respectively corresponding to all second voice segments of the second voice.
And performing voice segmentation on the second voice to obtain at least two second voice segments. Similarly, the second voice is voice-segmented by VAD to obtain at least two second voice segments. Because the test text corresponding to the second voice is the same as the test text corresponding to the first voice, the number of the second voice segments obtained by performing voice segmentation on the second voice through VAD is the same as the number of the first voice segments. Or performing voice segmentation on the second voice according to the number of the first voice segments to obtain at least two second voice segments, wherein the number of the second voice segments is equal to the number of the first voice segments.
And respectively carrying out Fourier transform on the at least two second voice segments to obtain second sound waveforms respectively corresponding to all the voice segments of the second voice. Taking the example that the second voice is divided into N second voice segments, the Fourier transform is respectively carried out on the N voice segments to obtain N second acoustic waveforms
Figure 329302DEST_PATH_IMAGE008
Figure 673696DEST_PATH_IMAGE008
The second acoustic waveform is the ith, N is an integer greater than or equal to 2, and i is an integer greater than or equal to 1 and less than or equal to N.
S1023: and acquiring a first result corresponding to the amplitude and a second result corresponding to the frequency according to the first sound wave waveform and the second sound wave waveform.
One possible implementation is: as shown in figure 3 of the drawings,
s10231: and carrying out cross-over ratio calculation on the first sound wave waveform and the second sound wave waveform to obtain a first result corresponding to the amplitude.
According to
Figure 437252DEST_PATH_IMAGE009
And determining the similarity between the amplitude of the first sound wave waveform and the amplitude of the second sound wave waveform to obtain a first result corresponding to the amplitude.
Wherein the content of the first and second substances,
Figure 474479DEST_PATH_IMAGE010
in order to achieve the first result,
Figure 690696DEST_PATH_IMAGE011
for the ith first acoustic waveform,
Figure 838781DEST_PATH_IMAGE012
for the ith second acoustic waveform,
Figure 722423DEST_PATH_IMAGE013
is the area of the ith first acoustic waveform,
Figure 930551DEST_PATH_IMAGE014
and the area of the ith second sound wave waveform is shown, i is an integer which is more than or equal to 1 and less than or equal to N, and N is the number of the first voice segments.
S10232: and performing similarity Hash operation on the first sound wave waveform to obtain a first similarity Hash operation result.
A hash-based similarity detection (simhash) operation is to generate a 64-bit signature by defining a waveform as 1 upward and 0 downward according to the shape of a waveform of a sound wave.
S10233: and performing similarity Hash operation on the second acoustic waveform to obtain a second similarity Hash operation result.
S10234: and acquiring the intersection of the first similarity hash operation result and the second similarity hash operation result to obtain a second result corresponding to the frequency.
According to
Figure 634064DEST_PATH_IMAGE015
And determining the similarity of the frequency of the first sound wave form and the frequency of the second sound wave form to obtain a second result corresponding to the frequency.
Wherein the content of the first and second substances,
Figure 320261DEST_PATH_IMAGE016
in order to achieve the second result,
Figure 58410DEST_PATH_IMAGE007
for the ith first acoustic waveform,
Figure 437438DEST_PATH_IMAGE008
for the ith second acoustic waveform,
Figure 628248DEST_PATH_IMAGE017
is a first simhash operation result corresponding to the ith first acoustic waveform,
Figure 118135DEST_PATH_IMAGE018
as the second simhash operation result corresponding to the ith second acoustic waveform,
Figure 445212DEST_PATH_IMAGE019
the Hamming (Hamming) distance between a first simhash operation result corresponding to the ith first sound wave waveform and a second simhash operation result corresponding to the ith second sound wave waveform is set, i is an integer which is greater than or equal to 1 and less than or equal to N, and N is the number of the first voice segments. The hamming distance refers to the number of characters that differ at the same position in two character strings of the same length.
S1024: and obtaining a first similarity according to the first result and the second result.
Optionally, according to
Figure 995142DEST_PATH_IMAGE020
Obtaining a first similarity;
wherein the content of the first and second substances,
Figure 673248DEST_PATH_IMAGE002
in order to be the first degree of similarity,
Figure 966826DEST_PATH_IMAGE003
in order to achieve the first result,
Figure 413988DEST_PATH_IMAGE004
in order to achieve the second result,
Figure 869240DEST_PATH_IMAGE005
is the number of first speech segments.
S103: and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result.
One possible implementation is: as shown in figure 4 of the drawings,
s1031: and if the first similarity is larger than a first preset threshold value, determining that the evaluation result of the first voice is consistent with the evaluation result of the second voice.
And the evaluation result of the second voice is high-quality voice or poor-quality voice. For example: and if the first similarity is greater than 0.8, determining that the evaluation result of the first voice is poor voice.
Optionally, fig. 4 may also be based on the embodiments shown in fig. 2 or fig. 3.
Another possible implementation is:
and if the first similarity is larger than or equal to a first preset threshold, determining that the evaluation result of the first voice is consistent with the evaluation result of the second voice, wherein the evaluation result of the second voice is high-quality voice or poor-quality voice.
In this embodiment, a test text is input into a speech synthesis model, and a first speech corresponding to the test text output by the speech synthesis model is obtained; obtaining a first similarity of the first voice and the second voice according to the audio features of the first voice and the audio features of the second voice corresponding to the test text; and determining the evaluation result of the first voice according to the first similarity and the known second voice evaluation result. Because the evaluation result of the second voice is known, the evaluation result of the first voice is determined according to the first similarity and the evaluation result of the second voice, so that the voice evaluation time is shortened, the interference of subjective factors of manual evaluation is reduced, and the accuracy of the evaluation result is improved, thereby improving the efficiency of voice evaluation.
Optionally, the audio features include: the length of sound;
another possible implementation manner of S102 is:
s1021': and acquiring the difference ratio of the sound lengths of the first voice and the second voice according to the sound length of the first voice and the sound length of the second voice.
Optionally, according to
Figure 34642DEST_PATH_IMAGE021
And acquiring the sound length difference ratio of the first voice and the second voice. Wherein the content of the first and second substances,
Figure 866332DEST_PATH_IMAGE022
is a soundThe ratio of the length difference to the length difference,
Figure 433579DEST_PATH_IMAGE023
is the total duration of the first speech sound,
Figure 62662DEST_PATH_IMAGE024
is the total duration of the second speech sound,
and S1022': and obtaining a first similarity of the first voice and the second voice according to the sound length difference ratio.
S1023': and obtaining the first similarity of the first voice and the second voice according to the length difference ratio and the mapping relation between the predefined length difference ratio and the first similarity.
Fig. 5 is a schematic flow chart of another embodiment of a speech evaluation method provided by the embodiment of the present disclosure, where fig. 5 is based on the embodiment shown in fig. 4, and further, after S103, the method further includes:
s104: and if the first similarity is not greater than a first preset threshold, inputting the first voice into a voice evaluation model to obtain an evaluation score.
One possible implementation is: and if the first similarity is smaller than a first preset threshold value, inputting the first voice into a voice evaluation model to obtain an evaluation score.
Another possible implementation is: and if the first similarity is less than or equal to a first preset threshold value, inputting the first voice into a voice evaluation model to obtain an evaluation score.
The speech evaluation model outputs evaluation scores according to the scores of at least two evaluation dimensions, the evaluation score corresponding to the first speech is the weighted sum of the scores of the at least two evaluation dimensions, and the evaluation dimensions comprise at least two of the following types: gulp, sentence break, mechanical voice, speech rate, and word stack.
Inputting the first voice into a voice evaluation model, obtaining scores of at least two evaluation dimensions, and obtaining a score according to the scores
Figure 980940DEST_PATH_IMAGE025
And obtaining an evaluation score. Wherein the content of the first and second substances,
Figure 350741DEST_PATH_IMAGE026
is an evaluation score for the first speech,
Figure 506916DEST_PATH_IMAGE027
for the weight of the nth evaluation dimension,
Figure 569550DEST_PATH_IMAGE028
for the score of the nth evaluation dimension,
Figure 709544DEST_PATH_IMAGE029
and
Figure 148616DEST_PATH_IMAGE028
are integers of 0 to 5 inclusive. For example: the voice evaluation model scores the first voice according to five evaluation dimensions (swallow, sentence break error, mechanical voice, voice speed and overlapping characters) to obtain scores of the first voice, namely 'no swallow', 'no sentence break error', 'no mechanical voice', 'normal voice speed' and 'no overlapping characters', as shown in table 1, the scores of all the evaluation dimensions are multiplied by the weight of all the evaluation dimensions to obtain 'single final score', then the 'single final scores' are added, and the evaluation score of the first voice is output to be 2.4. The weight of each evaluation dimension can be set according to specific requirements, which is not limited by the present disclosure.
TABLE 1
Without swallowing No sentence break error Without mechanical sound Normal speech rate Without overlapping characters
Score of 3 2 2 3 3
Weight of 0.2 0.2 0.4 0.1 0.1
Single term final score 0.6 0.4 0.8 0.3 0.3
The speech evaluation model is obtained by training a neural network in advance by using a sample set, wherein the sample set comprises speech which is generated by a TTS model and has swallow, sentence break error, mechanical speech, high and low speech speed, and overlapping characters, and the corresponding evaluation dimension scores. The speech evaluation model comprises an acoustic model and a scoring model, wherein the acoustic model can adopt a Convolutional Neural Network (CNN) model, and the scoring model can adopt a multi-classification Logistic Regression (LR) model. The voice evaluation model identifies the degree of each voice in the sample set, including swallow, sentence break error, mechanical voice, fast and slow voice rate and overlapping character, and classifies the voice to obtain a score corresponding to each evaluation dimension, and finally, the scores of each evaluation dimension are weighted and summed to obtain an evaluation score of 0-5.
S105: and determining an evaluation result of the first voice according to the evaluation score.
One possible implementation is: judging whether the evaluation score is larger than a second preset threshold value, if so, determining that the evaluation result of the first voice is a high-quality voice; if not, judging whether the evaluation score is smaller than a third preset threshold value, and if so, determining that the evaluation result of the first voice is poor voice; and if not, determining the evaluation result of the first voice in a manual evaluation mode.
Another possible implementation is: judging whether the evaluation score is greater than or equal to a second preset threshold, if so, determining that the evaluation result of the first voice is a high-quality voice; if not, judging whether the evaluation score is less than or equal to a third preset threshold, and if so, determining that the evaluation result of the first voice is poor voice; and if not, determining the evaluation result of the first voice in a manual evaluation mode. For example, a second preset threshold value is set to be 4, a third preset threshold value is set to be 2, and if the evaluation score is greater than or equal to 4, the evaluation result of the first voice is a high-quality voice; and if the evaluation score is less than or equal to 2, the evaluation result of the first voice is poor voice, and if the evaluation score is more than 2 and less than 4, the evaluation result of the first voice is determined through manual evaluation.
In this embodiment, if the first similarity is not greater than the first preset threshold, the first voice is input into the voice evaluation model to obtain an evaluation score, and an evaluation result of the first voice is determined according to the evaluation score. Because the voice evaluation model is a pre-trained neural network model for scoring the first voice based on a plurality of evaluation dimensions, the voice evaluation model is used for evaluating the first voice, the speed of voice evaluation is improved, and the accuracy and comprehensiveness of a voice evaluation result are improved, so that the efficiency of voice evaluation is improved.
Optionally, if the evaluation result of the first speech is poor-quality audio, the evaluation results of the first speech and the first speech are used as training samples, and the training samples are input into the TTS model for optimizing the TTS model, so that the robustness of the TTS model is improved.
Fig. 6 is a schematic flow chart of another embodiment of a speech evaluation method provided by the present disclosure, where fig. 6 is based on any one of the embodiments shown in fig. 1 to fig. 5, and before S102, the method may further include the following steps:
s601: and recognizing the first voice according to an automatic voice recognition algorithm to generate a first text.
An Automatic Speech Recognition (ASR) algorithm takes Speech as a research object, and converts a Speech signal into a corresponding text output through Speech signal processing and pattern Recognition.
S602: and comparing the test text with the first text to obtain a second similarity.
One possible implementation is: and comparing the test text with the first text according to a Levenshtein (Levenshtein) distance method to obtain a second similarity.
The Levenshtein distance refers to the minimum number of editing operations required for converting one of two character strings into the other, and the editing operations mainly comprise: inserting a character, deleting a character, and replacing a character with another character. The smaller the Levenshtein distance between two strings, the more similar they are represented.
For example, the content of the test text is string1, the content of the first text is string2, string1 needs M editing operations to convert to string2, according to which
Figure 159297DEST_PATH_IMAGE030
And acquiring a second similarity. Wherein M is the Levenshtein distance between the test text and the first text,
Figure 392832DEST_PATH_IMAGE031
to test the length of the content of the text,
Figure 754544DEST_PATH_IMAGE032
is the length of the content of the first text.
S603: and judging whether the second similarity is larger than a fourth preset threshold value.
One possible implementation is: judging whether the second similarity is greater than a fourth preset threshold, if not, executing S102; if yes, go to S604.
Another possible implementation is: judging whether the second similarity is greater than or equal to a fourth preset threshold, if not, executing S102; if yes, go to S604.
S604: and determining that the evaluation result of the first voice is poor voice.
For example: the test text is 'big good, i.e. small red from Beijing', the first voice is recognized according to an ASR algorithm, the generated first text is 'big good, i.e. small red', the test text is compared with the first text, the second similarity is obtained to be 0.67, the fourth preset threshold is 0.9, the second similarity is not greater than the fourth preset threshold, and then the evaluation result of the first voice is determined to be poor voice.
In this embodiment, before the first similarity between the first speech and the second speech is obtained according to the audio feature of the first speech and the audio feature of the second speech corresponding to the test text, the first speech is recognized according to an automatic speech recognition algorithm to generate a first text, the test text is compared with the first text to obtain the second similarity, the situation that 'stuttering' exists in the first speech can be recognized, poor-quality speech can be recognized more quickly, and therefore the speech evaluation efficiency is improved.
Optionally, before S601, the method may further include: and acquiring a tone output log of the first voice, and determining that the tone output log of the first voice is consistent with a standard tone output log corresponding to the test text. And if the tone output log of the first voice is inconsistent with the standard tone output log corresponding to the test text, determining that the evaluation result of the first voice is poor-quality audio.
Taking the test text 'Dajiahao, i is minired from Beijing' as an example, the standard tone output log corresponding to the test text is 'da 4 jia1 hao3, wo3 shi4 lai2 zi4 bei3 jing1 de0 xiao3 hong 2', if the tone output log of the first voice is 'da 4 jia4 hao4, wo1 shi1 lai1 zi3 bei3 jing1 de0 xiao3 hong 2', and the tone output log of the first voice is inconsistent with the standard tone output log corresponding to the test text, the evaluation result of the first voice is determined to be poor-quality audio.
Fig. 7 is a schematic structural diagram of a speech evaluation apparatus provided in the embodiment of the present disclosure, where the apparatus of the embodiment includes: an acquisition module 701 and a processing module 702.
The acquiring module 701 is configured to input the test text into the speech synthesis model, and acquire a first speech corresponding to the test text output by the speech synthesis model;
the processing module 702 is configured to obtain a first similarity between a first voice and a second voice according to an audio feature of the first voice and an audio feature of the second voice corresponding to the test text;
the processing module 702 is further configured to determine an evaluation result of the first speech according to the first similarity and the known second speech evaluation result.
Optionally, the audio features include: amplitude and frequency;
the processing module 702 is specifically configured to:
acquiring first sound wave waveforms respectively corresponding to all first voice segments of a first voice;
acquiring second sound waveforms respectively corresponding to all second voice segments of the second voice;
acquiring a first result corresponding to the amplitude and a second result corresponding to the frequency according to the first sound wave waveform and the second sound wave waveform;
and obtaining a first similarity according to the first result and the second result.
Optionally, the processing module 702 is specifically configured to:
performing cross-comparison calculation on the first sound wave waveform and the second sound wave waveform to obtain a first result corresponding to the amplitude;
performing similarity Hash operation on the first sound wave shape to obtain a first similarity Hash operation result;
carrying out similarity Hash operation on the second acoustic waveform to obtain a second similarity Hash operation result;
and acquiring the intersection of the first similarity hash operation result and the second similarity hash operation result to obtain a second result corresponding to the frequency.
Optionally, the processing module 702 is specifically configured to:
performing voice segmentation on the first voice to obtain at least two first voice segments;
fourier transformation is respectively carried out on at least two first voice segments to obtain first sound wave waveforms respectively corresponding to all voice segments of the first voice;
performing voice segmentation on the second voice to obtain at least two second voice segments;
and respectively carrying out Fourier transform on the at least two second voice segments to obtain second sound waveforms respectively corresponding to all the voice segments of the second voice.
Optionally, the processing module 702 is specifically configured to:
according to
Figure 731727DEST_PATH_IMAGE033
Obtaining a first similarity;
wherein the content of the first and second substances,
Figure 862494DEST_PATH_IMAGE002
in order to be the first degree of similarity,
Figure 266930DEST_PATH_IMAGE003
in order to achieve the first result,
Figure 115938DEST_PATH_IMAGE004
in order to achieve the second result,
Figure 631233DEST_PATH_IMAGE005
is the number of first speech segments.
Optionally, the processing module 702 is specifically configured to:
and if the first similarity is larger than a first preset threshold value, determining that the evaluation result of the first voice is consistent with the evaluation result of the second voice, wherein the evaluation result of the second voice is high-quality voice or poor-quality voice.
Optionally, the processing module 702 is further configured to:
if the first similarity is not larger than a first preset threshold, inputting the first voice into a voice evaluation model to obtain an evaluation score;
determining an evaluation result of the first voice according to the evaluation score;
the speech evaluation model outputs evaluation scores according to the scores of at least two evaluation dimensions, the evaluation score corresponding to the first speech is the weighted sum of the scores of the at least two evaluation dimensions, and the evaluation dimensions comprise at least two of the following types: gulp, sentence break, mechanical voice, speech rate, and word stack.
Optionally, the processing module 702 is specifically configured to:
if the evaluation score is larger than a second preset threshold, determining that the evaluation result of the first voice is a high-quality voice;
and if the evaluation score is smaller than a third preset threshold, determining that the evaluation result of the first voice is poor voice.
Optionally, the processing module 702 is further configured to:
recognizing the first voice according to an automatic voice recognition algorithm to generate a first text;
comparing the test text with the first text to obtain a second similarity;
and determining that the second similarity is greater than a fourth preset threshold.
Optionally, the processing module 702 is further configured to:
and if the second similarity is not greater than the fourth preset threshold, determining that the evaluation result of the first voice is poor voice.
The apparatus of this embodiment may be used to implement the technical solution of any one of the method embodiments shown in fig. 1 to fig. 6, and the implementation principle and the technical effect are similar, which are not described herein again.
The disclosed embodiment provides a computer device, including: the memory, the processor, and the computer program stored in the memory and capable of running on the processor, where the processor executes the computer program to implement the technical solution of any one of the method embodiments shown in fig. 1 to 6, and the implementation principle and the technical effect are similar, and are not described herein again.
The present disclosure also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the solution of the method embodiment shown in any one of fig. 1 to 6.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

1. A speech evaluation method, comprising:
inputting a test text into a voice synthesis model, and acquiring a first voice corresponding to the test text output by the voice synthesis model;
obtaining a first similarity of the first voice and the second voice according to the audio characteristics of the first voice and the audio characteristics of the second voice corresponding to the test text;
determining an evaluation result of the first voice according to the first similarity and a known second voice evaluation result;
wherein the audio features include: amplitude and frequency;
the obtaining a first similarity between the first voice and the second voice according to the audio feature of the first voice and the audio feature of the second voice corresponding to the test text includes:
acquiring first sound wave waveforms respectively corresponding to all first voice fragments of the first voice;
acquiring second sound wave waveforms respectively corresponding to all second voice fragments of the second voice;
acquiring a first result corresponding to the amplitude and a second result corresponding to the frequency according to the first sound wave waveform and the second sound wave waveform;
obtaining the first similarity according to the first result and the second result;
the obtaining a first result corresponding to the amplitude and a second result corresponding to the frequency according to the first acoustic waveform and the second acoustic waveform includes:
performing cross-comparison calculation on the first sound wave waveform and the second sound wave waveform to obtain a first result corresponding to the amplitude;
performing similarity Hash operation on the first sound wave shape to obtain a first similarity Hash operation result;
performing similarity hash operation on the second acoustic waveform to obtain a second similarity hash operation result;
and acquiring the intersection of the first similarity hash operation result and the second similarity hash operation result to obtain a second result corresponding to the frequency.
2. The method according to claim 1, wherein the obtaining of the first acoustic waveforms corresponding to all the voice segments of the first voice comprises:
performing voice segmentation on the first voice to obtain at least two first voice segments;
performing Fourier transform on the at least two first voice segments respectively to obtain first sound wave waveforms corresponding to all voice segments of the first voice respectively;
the obtaining of the second acoustic waveforms corresponding to all the voice segments of the second voice includes:
performing voice segmentation on the second voice to obtain at least two second voice segments;
and respectively carrying out Fourier transform on the at least two second voice segments to obtain second sound wave waveforms respectively corresponding to all the voice segments of the second voice.
3. The method according to claim 1 or 2, wherein the deriving the first similarity from the first result and the second result comprises:
according to
Figure 466427DEST_PATH_IMAGE001
Obtaining the first similarity;
wherein the content of the first and second substances,
Figure 727775DEST_PATH_IMAGE002
in order to be the first degree of similarity,
Figure 976354DEST_PATH_IMAGE003
in order to achieve the first result,
Figure 105984DEST_PATH_IMAGE004
in order to achieve the second result,
Figure 513832DEST_PATH_IMAGE005
is the number of first speech segments.
4. The method according to claim 1 or 2, wherein determining the evaluation result of the first speech according to the first similarity and the known second speech evaluation result comprises:
and if the first similarity is larger than a first preset threshold value, determining that the evaluation result of the first voice is consistent with the evaluation result of the second voice, wherein the evaluation result of the second voice is high-quality voice or poor-quality voice.
5. The method of claim 4, further comprising:
if the first similarity is not larger than the first preset threshold, inputting the first voice into a voice evaluation model to obtain an evaluation score;
determining an evaluation result of the first voice according to the evaluation score;
the speech evaluating model outputs the evaluating score according to the scores of at least two evaluating dimensions, the evaluating score corresponding to the first speech is the weighted sum of the scores of the at least two evaluating dimensions, and the evaluating dimensions comprise at least two of the following: gulp, sentence break, mechanical voice, speech rate, and word stack.
6. The method according to claim 5, wherein the determining an evaluation result of the first speech according to the evaluation score comprises:
if the evaluation score is larger than a second preset threshold value, determining that the evaluation result of the first voice is a high-quality voice;
and if the evaluation score is smaller than a third preset threshold value, determining that the evaluation result of the first voice is poor voice.
7. The method according to claim 1 or 2, wherein before obtaining the first similarity between the first speech and the second speech according to the audio feature of the first speech and the audio feature of the second speech corresponding to the test text, the method further comprises:
recognizing the first voice according to an automatic voice recognition algorithm to generate a first text;
comparing the test text with the first text to obtain a second similarity;
and determining that the second similarity is greater than a fourth preset threshold.
8. The method of claim 7, further comprising:
and if the second similarity is not greater than the fourth preset threshold, determining that the evaluation result of the first voice is poor voice.
9. A speech evaluation apparatus, comprising:
the acquisition module is used for inputting a test text into a voice synthesis model and acquiring a first voice corresponding to the test text output by the voice synthesis model;
the processing module is used for obtaining a first similarity of the first voice and the second voice according to the audio characteristics of the first voice and the audio characteristics of the second voice corresponding to the test text;
the processing module is further used for determining an evaluation result of the first voice according to the first similarity and a known second voice evaluation result;
wherein the audio features include: amplitude and frequency;
the processing module is specifically configured to:
acquiring first sound wave waveforms respectively corresponding to all first voice fragments of the first voice;
acquiring second sound wave waveforms respectively corresponding to all second voice fragments of the second voice;
acquiring a first result corresponding to the amplitude and a second result corresponding to the frequency according to the first sound wave waveform and the second sound wave waveform;
obtaining the first similarity according to the first result and the second result;
the processing module is specifically configured to:
performing cross-comparison calculation on the first sound wave waveform and the second sound wave waveform to obtain a first result corresponding to the amplitude;
performing similarity Hash operation on the first sound wave shape to obtain a first similarity Hash operation result;
performing similarity hash operation on the second acoustic waveform to obtain a second similarity hash operation result;
and acquiring the intersection of the first similarity hash operation result and the second similarity hash operation result to obtain a second result corresponding to the frequency.
10. A computer device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to any one of claims 1 to 8 when executing the computer program.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202110386211.3A 2021-04-12 2021-04-12 Voice evaluation method, device, computer equipment and medium Active CN112802494B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110386211.3A CN112802494B (en) 2021-04-12 2021-04-12 Voice evaluation method, device, computer equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110386211.3A CN112802494B (en) 2021-04-12 2021-04-12 Voice evaluation method, device, computer equipment and medium

Publications (2)

Publication Number Publication Date
CN112802494A CN112802494A (en) 2021-05-14
CN112802494B true CN112802494B (en) 2021-07-16

Family

ID=75817383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110386211.3A Active CN112802494B (en) 2021-04-12 2021-04-12 Voice evaluation method, device, computer equipment and medium

Country Status (1)

Country Link
CN (1) CN112802494B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113450768A (en) * 2021-06-25 2021-09-28 平安科技(深圳)有限公司 Speech synthesis system evaluation method and device, readable storage medium and terminal equipment
CN113763918A (en) * 2021-08-18 2021-12-07 单百通 Text-to-speech conversion method and device, electronic equipment and readable storage medium
CN114898733A (en) * 2022-05-06 2022-08-12 深圳妙月科技有限公司 AI voice data analysis processing method and system

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102592589A (en) * 2012-02-23 2012-07-18 华南理工大学 Speech scoring method and device implemented through dynamically normalizing digital characteristics
CN103871426A (en) * 2012-12-13 2014-06-18 上海八方视界网络科技有限公司 Method and system for comparing similarity between user audio frequency and original audio frequency
JP5772054B2 (en) * 2011-02-23 2015-09-02 ヤマハ株式会社 Singing evaluation device
JP5805474B2 (en) * 2011-09-09 2015-11-04 ブラザー工業株式会社 Voice evaluation apparatus, voice evaluation method, and program
CN108597538A (en) * 2018-03-05 2018-09-28 标贝(北京)科技有限公司 The evaluating method and system of speech synthesis system
CN108922563A (en) * 2018-06-17 2018-11-30 海南大学 Based on the visual verbal learning antidote of deviation organ morphology behavior
CN109344388A (en) * 2018-08-02 2019-02-15 中央电视台 A kind of comment spam recognition methods, device and computer readable storage medium
CN109920431A (en) * 2019-03-05 2019-06-21 百度在线网络技术(北京)有限公司 Method and apparatus for output information
CN110148427A (en) * 2018-08-22 2019-08-20 腾讯数码(天津)有限公司 Audio-frequency processing method, device, system, storage medium, terminal and server
CN110400578A (en) * 2019-07-19 2019-11-01 广州市百果园信息技术有限公司 The generation of Hash codes and its matching process, device, electronic equipment and storage medium
CN110660383A (en) * 2019-09-20 2020-01-07 华南理工大学 Singing scoring method based on lyric and singing alignment
CN110726898A (en) * 2018-07-16 2020-01-24 北京映翰通网络技术股份有限公司 Power distribution network fault type identification method
CN110853679A (en) * 2019-10-23 2020-02-28 百度在线网络技术(北京)有限公司 Speech synthesis evaluation method and device, electronic equipment and readable storage medium
CN111091816A (en) * 2020-03-19 2020-05-01 北京五岳鑫信息技术股份有限公司 Data processing system and method based on voice evaluation
CN111477251A (en) * 2020-05-21 2020-07-31 北京百度网讯科技有限公司 Model evaluation method and device and electronic equipment
CN111916108A (en) * 2020-07-24 2020-11-10 北京声智科技有限公司 Voice evaluation method and device
CN112397056A (en) * 2021-01-20 2021-02-23 北京世纪好未来教育科技有限公司 Voice evaluation method and computer storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8050918B2 (en) * 2003-12-11 2011-11-01 Nuance Communications, Inc. Quality evaluation tool for dynamic voice portals
US8560318B2 (en) * 2010-05-14 2013-10-15 Sony Computer Entertainment Inc. Methods and system for evaluating potential confusion within grammar structure for set of statements to be used in speech recognition during computing event
KR101402805B1 (en) * 2012-03-27 2014-06-03 광주과학기술원 Voice analysis apparatus, voice synthesis apparatus, voice analysis synthesis system
US9384728B2 (en) * 2014-09-30 2016-07-05 International Business Machines Corporation Synthesizing an aggregate voice

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5772054B2 (en) * 2011-02-23 2015-09-02 ヤマハ株式会社 Singing evaluation device
JP5805474B2 (en) * 2011-09-09 2015-11-04 ブラザー工業株式会社 Voice evaluation apparatus, voice evaluation method, and program
CN102592589A (en) * 2012-02-23 2012-07-18 华南理工大学 Speech scoring method and device implemented through dynamically normalizing digital characteristics
CN103871426A (en) * 2012-12-13 2014-06-18 上海八方视界网络科技有限公司 Method and system for comparing similarity between user audio frequency and original audio frequency
CN108597538A (en) * 2018-03-05 2018-09-28 标贝(北京)科技有限公司 The evaluating method and system of speech synthesis system
CN108922563A (en) * 2018-06-17 2018-11-30 海南大学 Based on the visual verbal learning antidote of deviation organ morphology behavior
CN110726898A (en) * 2018-07-16 2020-01-24 北京映翰通网络技术股份有限公司 Power distribution network fault type identification method
CN109344388A (en) * 2018-08-02 2019-02-15 中央电视台 A kind of comment spam recognition methods, device and computer readable storage medium
CN110148427A (en) * 2018-08-22 2019-08-20 腾讯数码(天津)有限公司 Audio-frequency processing method, device, system, storage medium, terminal and server
CN109920431A (en) * 2019-03-05 2019-06-21 百度在线网络技术(北京)有限公司 Method and apparatus for output information
US20200286470A1 (en) * 2019-03-05 2020-09-10 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for outputting information
CN110400578A (en) * 2019-07-19 2019-11-01 广州市百果园信息技术有限公司 The generation of Hash codes and its matching process, device, electronic equipment and storage medium
CN110660383A (en) * 2019-09-20 2020-01-07 华南理工大学 Singing scoring method based on lyric and singing alignment
CN110853679A (en) * 2019-10-23 2020-02-28 百度在线网络技术(北京)有限公司 Speech synthesis evaluation method and device, electronic equipment and readable storage medium
CN111091816A (en) * 2020-03-19 2020-05-01 北京五岳鑫信息技术股份有限公司 Data processing system and method based on voice evaluation
CN111477251A (en) * 2020-05-21 2020-07-31 北京百度网讯科技有限公司 Model evaluation method and device and electronic equipment
CN111916108A (en) * 2020-07-24 2020-11-10 北京声智科技有限公司 Voice evaluation method and device
CN112397056A (en) * 2021-01-20 2021-02-23 北京世纪好未来教育科技有限公司 Voice evaluation method and computer storage medium

Also Published As

Publication number Publication date
CN112802494A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112802494B (en) Voice evaluation method, device, computer equipment and medium
CN110021308B (en) Speech emotion recognition method and device, computer equipment and storage medium
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
CN101136199B (en) Voice data processing method and equipment
CN112397091B (en) Chinese speech comprehensive scoring and diagnosing system and method
US11158322B2 (en) Human resolution of repeated phrases in a hybrid transcription system
US9984677B2 (en) Bettering scores of spoken phrase spotting
US8494853B1 (en) Methods and systems for providing speech recognition systems based on speech recordings logs
US10490182B1 (en) Initializing and learning rate adjustment for rectifier linear unit based artificial neural networks
CN109036471B (en) Voice endpoint detection method and device
CN101887725A (en) Phoneme confusion network-based phoneme posterior probability calculation method
CN109461441B (en) Self-adaptive unsupervised intelligent sensing method for classroom teaching activities
CN110390948B (en) Method and system for rapid speech recognition
CN113920986A (en) Conference record generation method, device, equipment and storage medium
KR100682909B1 (en) Method and apparatus for recognizing speech
JP5050698B2 (en) Voice processing apparatus and program
JP5376341B2 (en) Model adaptation apparatus, method and program thereof
WO2018163279A1 (en) Voice processing device, voice processing method and voice processing program
CN111091809A (en) Regional accent recognition method and device based on depth feature fusion
JP2008176202A (en) Voice recognition device and voice recognition program
CN113823326B (en) Method for using training sample of high-efficiency voice keyword detector
CN112397048B (en) Speech synthesis pronunciation stability evaluation method, device and system and storage medium
CN112767961B (en) Accent correction method based on cloud computing
JPWO2020003413A1 (en) Information processing equipment, control methods, and programs
Saputri et al. Identifying Indonesian local languages on spontaneous speech data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant