CN113327615B

CN113327615B - Voice evaluation method, device, equipment and storage medium

Info

Publication number: CN113327615B
Application number: CN202110882055.XA
Authority: CN
Inventors: 郭立钊; 白锦峰
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-11-16
Anticipated expiration: 2041-08-02
Also published as: CN113327615A

Abstract

The disclosure provides a voice evaluation method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring audio to be tested and a reference text corresponding to the audio to be tested; acquiring the evaluation pause duration of each rhythm level of the audio to be detected according to the audio to be detected and the reference text; obtaining the level pause evaluation result of each rhythm level according to the evaluation pause duration corresponding to the same rhythm level and a preset reference pause duration; and obtaining an audio rhythm evaluation result of the audio to be tested according to the level pause evaluation result of each rhythm level. The voice evaluation method provided by the disclosure can realize the emotion evaluation of voice.

Description

Voice evaluation method, device, equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the field of computers, and in particular, to a method, an apparatus, a device and a storage medium for voice evaluation.

Background

With the development of computer technology and deep learning, computer-aided pronunciation training becomes one of the current research hotspots, and from word reciting in English learning to Chinese reading training, the computer-aided pronunciation training system can help students to learn spoken languages more conveniently and efficiently.

The speech evaluation is to make a machine understand the pronunciation of a person and give feedback or score in time so as to evaluate the pronunciation quality of the person, the speech evaluation in the prior art is mostly based on the evaluation of whether the pronunciation is correct or not, a score is fed back according to whether the pronunciation is correct or not, the pronunciation quality is evaluated for correcting the pronunciation, however, the recitation evaluation surrounding the evaluation of emotion degree also belongs to one type of speech evaluation, and the reciting person needs to express the thought emotion of a work accurately, not only needs to correct the pronunciation, but also needs to understand the inherent meaning of the work so as to express the thought emotion of the work.

Therefore, how to implement emotion assessment for voice becomes a technical problem to be solved urgently.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present disclosure is to provide a method, an apparatus, a device and a storage medium for speech evaluation, so as to implement emotion evaluation for speech.

In order to solve the above problem, an embodiment of the present disclosure provides a speech evaluation method, including:

acquiring audio to be tested and a reference text corresponding to the audio to be tested;

acquiring the evaluation pause duration of each rhythm level of the audio to be detected according to the audio to be detected and the reference text;

obtaining the level pause evaluation result of each rhythm level according to the evaluation pause duration corresponding to the same rhythm level and a preset reference pause duration;

and obtaining an audio rhythm evaluation result of the audio to be tested according to the level pause evaluation result of each rhythm level.

In order to solve the above problem, an embodiment of the present disclosure further provides a speech evaluation apparatus, including:

the device comprises a to-be-tested audio acquisition unit, a to-be-tested audio acquisition unit and a reference text, wherein the to-be-tested audio acquisition unit is suitable for acquiring a to-be-tested audio and a reference text corresponding to the to-be-tested audio;

the evaluation pause duration obtaining unit is suitable for obtaining the evaluation pause duration of each rhythm level of the audio to be detected according to the audio to be detected and the reference text;

the hierarchy pause evaluation result acquisition unit is suitable for acquiring hierarchy pause evaluation results of all rhythm hierarchies according to evaluation pause time lengths corresponding to the same rhythm hierarchy and preset reference pause time lengths;

and the audio rhythm evaluation result acquisition unit is suitable for acquiring the audio rhythm evaluation result of the audio to be tested according to the level pause evaluation result of each rhythm level.

In order to solve the above problem, an embodiment of the present disclosure further provides an electronic device, including: a processor, and a memory storing a program, wherein the program comprises instructions which, when executed by the processor, cause the processor to perform a speech evaluation method as previously described.

To solve the above problem, embodiments of the present disclosure also provide a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are configured to cause the computer to execute the speech evaluation method as described above.

Compared with the prior art, the technical scheme of the disclosure has the following advantages:

when performing voice evaluation, firstly acquiring a to-be-tested audio and a reference text corresponding to the to-be-tested audio; then obtaining the evaluation pause duration of each rhythm level of the audio to be tested according to the audio to be tested and the reference text; obtaining the level pause evaluation result of each rhythm level according to the evaluation pause duration corresponding to the same rhythm level and a preset reference pause duration; and finally, acquiring an audio rhythm evaluation result of the audio to be tested according to the level pause evaluation result of each rhythm level. It can be seen that the speech evaluation method provided by the embodiment of the present disclosure obtains the level pause evaluation results of each prosody level of the audio to be evaluated by using the evaluation pause durations of each prosody level of the audio to be evaluated and combining the reference pause durations corresponding to the same prosody level, and further obtains the audio prosody evaluation result of the whole audio to be evaluated.

Drawings

FIG. 1 is a schematic flow chart diagram of a speech evaluation method according to an embodiment of the present disclosure;

FIG. 2 is a flowchart for obtaining a reference pause duration of a speech evaluation method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a first pause duration obtaining method of a speech evaluation method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a training procedure of a prosody hierarchy classification model of a speech evaluation method according to an embodiment of the disclosure;

FIG. 5 is a block diagram of a speech evaluation device provided by an embodiment of the present disclosure;

FIG. 6 is another block diagram of a speech evaluation device provided by an embodiment of the present disclosure;

fig. 7 is an alternative hardware device architecture of a device provided by embodiments of the present disclosure.

Detailed Description

In the prior art, voice evaluation is mostly based on whether pronunciation is correct or not, and a score is fed back according to whether pronunciation is correct or not, so that pronunciation quality is evaluated and used for correcting pronunciation.

In order to realize emotion assessment on voice, the embodiment of the disclosure provides a voice assessment method, a device, equipment and a storage medium, wherein the voice assessment method comprises the following steps:

It can be seen that, in the speech evaluation method provided by the embodiment of the present disclosure, when performing speech evaluation, first, a to-be-tested audio and a reference text corresponding to the to-be-tested audio are obtained; then obtaining the evaluation pause duration of each rhythm level of the audio to be tested according to the audio to be tested and the reference text; obtaining the level pause evaluation result of each rhythm level according to the evaluation pause duration corresponding to the same rhythm level and a preset reference pause duration; and finally, acquiring an audio rhythm evaluation result of the audio to be tested according to the level pause evaluation result of each rhythm level.

Therefore, the speech evaluation method provided by the embodiment of the disclosure obtains the level pause evaluation results of each rhythm level of the audio to be evaluated by evaluating the pause duration of each rhythm level of the audio to be evaluated and combining the reference pause duration corresponding to the same rhythm level, and further obtains the audio rhythm evaluation result of the whole audio to be evaluated.

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

Referring to fig. 1, fig. 1 is a flow chart of a speech evaluation method according to an embodiment of the present disclosure.

As shown in the figure, the speech evaluation method provided by the embodiment of the present disclosure includes the following steps:

step S10: and acquiring the audio to be tested and a reference text corresponding to the audio to be tested.

It is easy to understand that, in order to implement emotion assessment on voice, a to-be-tested audio and a text corresponding to the to-be-tested audio are obtained first, where the text corresponding to the to-be-tested audio and the to-be-tested audio may be a reading audio of a student and a corresponding text, may also be a reading audio of a teacher for teaching and a corresponding text, and may also be a reading audio of other readers and corresponding texts.

It can be seen that in the speech evaluation method provided by the present disclosure, no special limitation is made on the audio to be tested and the source of the text corresponding to the audio to be tested, which means that the speech evaluation method provided by the present disclosure can be conveniently used to perform speech evaluation on any audio to be tested, and an evaluation result is obtained, so that the speech evaluation method provided by the present disclosure has a wider application scenario and is not limited to evaluation of the speech read by students.

Step S11: and acquiring the evaluation pause duration of each rhythm level of the audio to be detected according to the audio to be detected and the reference text.

After the audio to be tested and the reference text corresponding to the audio to be tested are obtained, the evaluation pause duration of each rhythm level of the audio to be tested is further obtained, and preparation is made for realizing emotion evaluation subsequently.

It is easy to understand that the evaluation pause duration of each prosody level of the audio to be tested refers to pause durations respectively obtained for each prosody level of the audio to be tested.

Specifically, the prosodic hierarchy may include four different types, prosodic word layer #1, prosodic phrase layer #2, intonation phrase layer #3, and sentence layer #4, wherein:

a rhythm word layer: representing basic prosodic units, there is no pause within prosodic words, it is not necessary to pause at prosodic word boundaries, in the absence of a token, phonetic words are identical to lexical words, and in some cases may be larger than words.

Prosodic phrase layer: the prosodic phrases are combined and correspond to a complete prosodic expression (without ventilation), each prosodic phrase consists of one or more prosodic words, the length of each prosodic phrase is generally considered to be 7 syllables (one character is one syllable), the variation length is 2 syllables, the length of each prosodic phrase is equivalent to that of a respiratory group, and each prosodic phrase has a relatively stable short intonation mode and a phrase stress configuration mode.

Intonation phrase layer: intonation phrases, which are the longest phonetic component, are generally longer than prosodic phrases, and grammatically correspond to longer phrases or shorter sentences, have a particular pattern of intonation that may be related in some way to the syntactic or chapter structure.

Statement layer: typically divided by punctuation.

For convenience of description in conjunction with the case, wherein #1 denotes a prosodic word layer; #2 denotes a prosodic phrase layer; #3 denotes the intonation phrase layer; #4 denotes a statement layer.

When the evaluation pause duration is obtained, each prosody level of the audio to be measured is obtained firstly, and then the prediction pause duration of each prosody level is further obtained.

In a specific implementation manner, in order to improve accuracy of an evaluation pause duration of each prosody level of an obtained audio to be tested, a speech evaluation method provided in an embodiment of the present disclosure may include:

firstly, obtaining each prosody level of the reference text and the pause position of each prosody level;

and then, determining the pause duration of the pause position of each rhythm level according to the audio to be tested to obtain each evaluation pause duration.

The rhythm of the voice is a part for expressing the emotion and rhythm of the voice, and is mainly realized by the pause duration of different rhythm levels, so that the rhythm level is accurately divided and the pause position of the rhythm level is marked, and the emotion degree of the audio to be detected can be judged better from the dimension of reading the rhythm.

On the basis of obtaining the rhythm level and the pause position of the rhythm level, the pause time of the pause position of each rhythm level is determined according to the audio to be tested, so that the evaluation pause time is obtained, the pause positions of the rhythm level and the rhythm level can well reflect the reading rhythm of a reader, and therefore the obtained evaluation pause time can well reflect the reading rhythm of the reader, so that the emotion evaluation of the voice can be realized in the reading evaluation around the emotion evaluation from the dimension of the reading rhythm.

Specifically, each prosody level of the reference text and each pause position of the prosody level may be acquired by:

firstly, determining whether each prosody level and each stopping position of each prosody level of the reference text are acquired according to the reference text, and if the prosody levels and each stopping position of each prosody level of the reference text are acquired, directly using the acquired prosody levels; otherwise, when determining that each prosody level and each stopping position of each prosody level of the reference text are not obtained, performing prosody level division on the reference text to obtain each prosody level of the reference text and each stopping position of each prosody level.

It is easy to understand that the obtained reference text corresponding to the reference audio may have been already divided into prosody hierarchies and pause positions of the prosody hierarchies, so that the reference text corresponding to each of the prosody hierarchies and pause positions of each of the prosody hierarchies can be directly obtained, and if it is determined that the reference text has not been previously divided into hierarchies and pause positions of the prosody hierarchies have not been obtained, the reference text is divided into prosody hierarchies first, and then each of the prosody hierarchies and pause positions of each of the prosody hierarchies of the reference text can be obtained.

Specifically, it is possible to directly search for the presence or absence of the reference text in the reference text library that has divided the prosody hierarchy and obtained the pause position of the prosody hierarchy.

Thus, through the judgment, the obtained prosody levels of the reference text and the pause positions of the prosody levels can be directly utilized, so that the repeated execution of prosody level division on the reference text is avoided, and the efficiency is improved.

In one embodiment, when it is determined that the prosody levels and the pause positions of the prosody levels of the reference text are not obtained, prosody level division may be performed on the reference text by using a prosody level division model trained in advance to obtain the prosody levels and the pause positions of the prosody levels of the reference text.

Specifically, a reference text corresponding to the audio to be tested is input into a trained prosody level division model, and each prosody level and a pause position of each prosody level can be obtained after the prosody level division model.

For the convenience of understanding the foregoing schemes, the following description will be made with reference to specific cases:

such as: for "Children hospital, beep and dad together. "this sentence, first, the example sentence is input to the prosody hierarchy classification model for prosody hierarchy classification:

inputting: children's hospital, beep and dad together.

Performing prosody level division by a prosody level division model:

and (3) outputting: child #1 hospital #3, beep #1 and dad #2 together # 4.

And after obtaining each prosody level of the reference text and the pause position of each prosody level, further determining the evaluation pause duration of each prosody level according to the audio to be detected.

And continuously combining the cases, determining the pause duration between the child and the doctor according to the audio to be detected to obtain the evaluation pause duration of the first rhythm level #1, then obtaining the pause duration between the courtyard and the beep to obtain the evaluation pause duration of the second rhythm level #3, and so on to obtain the pause durations of all rhythm levels of the audio to be detected.

Therefore, by the method, on one hand, each prosody level of the audio to be detected can be accurately and conveniently acquired, and on the other hand, the determination of the evaluation pause duration can be facilitated.

Specifically, in order to obtain the evaluation pause duration conveniently, it may be determined which word each speech frame of the audio to be tested belongs to by a forced alignment model, that is, the number of speech frames corresponding to each word is obtained, because the speech frame has a fixed window length and a frame shift (generally, a window length of 25ms, a frame shift of 10 ms), and then time point information of each word is obtained, where the time point information also includes the silence time point information of the pause in the audio, for example:

in combination with the above case, as a result of forced alignment by the forced alignment model, in "child #1 hospital #3," in:

the 1 st frame to the 4 th frame of the voice frame are ' children ' words, the 5 th frame to the 7 th frame of the voice frame are ' children ' words, the middle is the pause duration of #1, the 8 th frame to the 9 th frame of the voice frame are ' doctors ' words, the 10 th frame to the 15 th frame are ' yard ' words, namely, the pause durations of ' children ', ' hospitals ', ' beeps ', ' dad ' and ' in the audio are respectively judged by forced alignment, the mute duration of the mute time point of the pause position is also included, the pause durations of all levels of the audio to be detected are known according to the mute duration of the pause position of the audio to be detected, and then the evaluated pause duration is obtained according to the pause durations of all levels of the audio to be detected.

In a specific implementation manner, the step of determining the pause duration of the pause position of each prosody hierarchy according to the audio to be tested by the speech evaluation method according to the embodiment of the present invention to obtain each evaluation pause duration may include:

and acquiring the pause duration of each audio frequency to be tested at different rhythm levels according to the audio frequency to be tested and the text to be tested which are mutually corresponding by utilizing a pre-trained forced alignment model.

And obtaining the evaluation pause duration corresponding to the prosody hierarchy according to each pause duration of the same prosody hierarchy to obtain the evaluation pause durations of different prosody hierarchies.

That is, the pause duration of each prosody level of the audio to be tested is obtained first, where each prosody level refers to all prosody levels including the same prosody level and different prosody levels.

Combining the foregoing cases again: child #1 hospital #3, beep #1 and dad #2 together # 4. It can be seen that in this case, there are two prosody levels of #1, and the number of pause durations is 5.

In the previous case, #1 stalls in two places: the pause durations of the children #1 and the beeps #1 are respectively marked as A1 and A2, correspondingly, the pause durations at the pause positions of #2, #3 and #4 in the text to be detected are taken to respectively obtain the evaluation pause durations of #2, #3 and #4, which are respectively marked as B, C and D.

Obtaining the evaluation pause duration of each rhythm level:

the evaluation pause duration of the first tier #1 is: A1.

the evaluation pause duration for the second tier #1 is: A2.

the evaluation pause duration for tier #2 is: B.

the evaluation pause duration for tier #3 is: C.

the evaluation dwell duration for tier #4 is: D.

of course, in another embodiment, when calculating the evaluation pause duration, it is also possible to obtain an average value of the pause durations corresponding to the same prosody hierarchy and then obtain the evaluation pause duration of the prosody hierarchy, i.e. in the foregoing case, #1 pauses at two places: the pause durations of children #1 and beeps #1 are respectively marked as A1 and A2, the evaluation pause durations of corresponding rhythm levels can be the average values of A1 and A2 and are marked as A, correspondingly, the average values of the pause durations at pause positions #2, #3 and #4 in the text to be tested are taken, and the evaluation pause durations at pause positions #2, #3 and #4 are respectively obtained and are respectively marked as B, C and D.

Obtaining the evaluation pause duration of each rhythm level:

the evaluation pause duration for tier #1 is: A.

the evaluation pause duration for tier #2 is: B.

the evaluation pause duration for tier #3 is: C.

the evaluation dwell duration for tier #4 is: D.

therefore, by dividing the rhythm level and marking the pause position of the rhythm level at first and further determining the pause time of the pause position of each rhythm level according to the audio to be tested, the pause time can be conveniently evaluated, and the emotion evaluation of the audio to be tested can be conveniently carried out subsequently.

Step S12: and obtaining the level pause evaluation result of each prosody level according to the evaluation pause duration corresponding to the same prosody level and a preset reference pause duration.

And after the evaluation pause duration is obtained, further combining the reference pause duration to obtain the level pause evaluation result of each rhythm level.

It is easy to understand that the reference pause duration is obtained based on the reference audio meeting the reading quality requirement, so that the reference pause duration meets the predetermined quality requirement, the reading rhythm of the reference audio can be well reflected, and the obtained level pause evaluation result of the rhythm level can evaluate the audio to be tested more accurately.

In a specific embodiment, to facilitate obtaining a predetermined reference pause duration, please refer to fig. 2, where fig. 2 is a flowchart of obtaining the reference pause duration of the speech evaluation method according to the embodiment of the present disclosure.

As shown in the figure, the reference pause duration of the speech evaluation method provided by the embodiment of the present disclosure can be obtained by the following steps:

step S120, acquiring each reference audio and each reference text corresponding to each reference audio respectively;

in order to more accurately perform voice evaluation on the audio to be tested, the acquisition of the reference pause duration is particularly important, and the high quality of the voice evaluation on the audio to be tested can be ensured because the reference audio meets the preset quality requirement.

It is easy to understand that the reference audio refers to reference reading audio meeting quality requirements, and the reference text refers to text corresponding to the reference audio, and may be obtained through a speech recognition method or other methods. And in order to ensure the accuracy and applicability of the acquired reference pause duration, a large number of reference audios and reference texts need to be based on, so that the number of the acquired reference audios and reference texts is not only one.

Step S121: and obtaining first pause duration of each different prosody level of each reference audio according to the reference audio and the reference text which correspond to each other.

The first pause duration is obtained, so that the reference pause duration can be conveniently obtained subsequently.

After the reference audio and the reference text are obtained, the reference pause duration is obtained based on information contained in the reference audio and the reference text, the number of the prosody hierarchies of the reference text corresponding to a large number of reference audios is large due to the large number of the reference audio, and in order to obtain the reference pause duration finally used for the hierarchy pause evaluation result, first pause durations of a large number of prosody hierarchies are obtained.

In one embodiment, referring to fig. 3, the step of obtaining the first pause duration of each different prosody level of each reference audio provided by the speech evaluation method according to the embodiment of the present disclosure may include:

step 1210: and carrying out prosody hierarchy division on the reference text to obtain each prosody hierarchy of the reference text and a pause position of each prosody hierarchy.

The method comprises the steps that firstly, the rhythm level of a reference audio is divided into different rhythm levels, the rhythm of a voice is a part of expressing the emotional rhythm of the voice, the voice is realized mainly through the pause durations of the different rhythm levels, the rhythm levels of a reference text and the pause positions of the rhythm levels can be compared with the rhythm level of an audio to be tested and the pause positions of the rhythm levels by obtaining the rhythm levels of the reference text and the pause positions of the rhythm levels, and the emotion degree of the audio to be tested can be evaluated better from the dimension of reading the rhythm.

In a specific embodiment, in the speech evaluation method provided by the present disclosure, the step of performing prosody hierarchy classification on the reference text to obtain each prosody hierarchy of the reference text and a pause position of each prosody hierarchy includes:

and performing prosodic hierarchy division on the reference text by using a prosodic hierarchy division model to obtain each prosodic hierarchy of the reference text and the pause position of each prosodic hierarchy, wherein the prosodic hierarchy is the same as the prosodic hierarchy of the audio to be detected and comprises a prosodic word layer, a prosodic phrase layer, a intonation phrase layer and a sentence layer.

For example:

for ease of understanding, the following description will be made with reference to the case again:

the following example sentences are input into a prosody level division model, and the prosody level division model performs prosody level division on the example sentences, and the output result is as follows:

inputting:

example 1: the web site of lie is quickly sealed by police.

Example 2: children's hospital, beep and dad together.

Example 3: the examination shows that the old has slight cerebral thrombosis.

And (3) outputting:

example 1: li certain #1 website #3 was soon #2 verified by police #1 # 4.

Example 2: child #1 hospital #3, beep #1 and dad #2 together # 4.

Example 3: upon examination #3, the elderly had #1 mild #2 cerebral thrombosis # 4.

In this way, by dividing the prosody hierarchy of the sentence through the prosody hierarchy dividing model, the classification of the prosody hierarchy of the sentence and the pause position can be obtained, for example:

in example 1: the method comprises the steps that #1 of a plum is paused after the plum, the prosody level is #1 prosody word layer, a website #3 is paused after the website, the prosody level is #3 intonation phrase layer, a website #2 is paused after the website is fast, the prosody level is #2 prosody phrase layer, a policeman #1 is paused after the policeman, and the prosody level is #1 prosody word layer, wherein #4 is sealed. [ pause after sentence No. and prosodic level #4 sentence level ].

Therefore, the prosody hierarchy division model is used for carrying out prosody hierarchy division on the reference text to obtain each prosody hierarchy corresponding to the reference text and each pause position of each prosody hierarchy, so that the accurate rhythm change of the reference text is correspondingly obtained, and meanwhile, the prosody hierarchy division model meets the preset quality requirement in the training process, so that each prosody hierarchy corresponding to the high-quality reference text and each pause position of each prosody hierarchy can be ensured to be obtained, and further, the change of the rhythm of the reference text can be mastered at high quality, so that a basis is provided for starting from the dimensionality of the reading rhythm and realizing the emotion assessment of the voice.

In a specific implementation manner, the prosody hierarchy classification model in the speech evaluation method provided by the embodiment of the disclosure may be trained in the following manner, please refer to fig. 4, where fig. 4 is a schematic diagram of a training step of the prosody hierarchy classification model of the speech evaluation method provided by the embodiment of the disclosure.

As shown in the figure, the prosodic hierarchy classification model may be trained by the following steps, including:

step S40: and acquiring a sample text training set, wherein the sample text training set comprises a to-be-detected sample text and a reference sample text which correspond to each other, and each reference prosody level is marked on the reference sample text.

In order to train the prosody level division model, a sample text training set is required to be obtained, the sample text training set comprises a to-be-measured sample text and a reference sample text which correspond to each other, each reference prosody level is labeled on the reference sample text, the predetermined quality requirements are met, and the predicted prosody level of the to-be-measured sample text can be predicted according to the reference prosody levels of the reference sample text meeting the quality requirements.

Step S41: and acquiring the predicted prosody level of the sample text to be detected according to the sample text to be detected by utilizing the prosody level division model.

According to the sample text to be detected, the prosody level division model obtains a preliminary prediction prosody level of the sample text to be detected.

Step S42: and determining a first loss of the sample text to be detected according to the predicted prosody level and the reference prosody level.

The prosody level division model calculates a first loss of the text to be tested according to the reference prosody level of the text to be tested, and at this time, the obtained predicted prosody level of the text to be tested is not perfect, and the quality of the obtained predicted prosody level needs to be judged according to the first loss of the text to be tested.

Step S43: whether the first loss satisfies a predetermined threshold, if yes, step S45 is performed, and if no, step S44 is performed.

The quality of the predicted prosody hierarchy is judged using the first loss, and if the first loss does not satisfy the predetermined threshold, step S44 is performed until the first loss threshold satisfies the predetermined threshold, and step S45 is performed.

Step S44: adjusting parameters of the prosodic hierarchy partitioning model.

And according to the first loss threshold, adjusting parameters of a prosody level division model according to the reference prosody level of the reference sample text meeting the preset quality requirement, so that the prosody level division model obtains a new predicted prosody level of the to-be-detected sample text according to the new model parameters, and the new predicted prosody level is more complete.

Step S45: and obtaining the prosodic hierarchy classification model after training.

And obtaining a prosody hierarchy classification model after training through the training of the steps.

Therefore, in the training process of the prosody hierarchy classification model, the model can learn the characteristic of the standard prosody hierarchy classification of the standard sample text by using the standard prosody hierarchy of the standard sample text and the sample text to be tested, so that the predicted prosody hierarchy of the sample text to be tested with the characteristic of the standard prosody hierarchy is obtained.

Step S1211: and determining the pause duration of the pause position of each prosody hierarchy according to the reference audio corresponding to the reference text to obtain each first pause duration.

After the pause positions of the prosody hierarchies corresponding to the reference text are obtained, pause durations of the pause positions of the prosody hierarchies are further determined by combining the reference audio corresponding to the reference text, and a first pause duration is obtained.

Firstly, the rhythm level is accurately divided for a reference text, so that the accurate rhythm level and the pause position of the rhythm level are obtained, on the basis, the first pause duration is obtained according to the reference audio, and the accurate rhythm level and the pause position of the rhythm level can well reflect the reading rhythm of a reader.

In a specific embodiment, the step of determining the pause duration of the pause position of each prosody hierarchy according to the reference audio corresponding to the reference text to obtain each first pause duration provided by the speech evaluation method of the present disclosure includes:

and determining the pause duration of the pause position of each prosody hierarchy in the reference audio corresponding to the reference text by using a forced alignment model to obtain each first pause duration.

The forced alignment model has strict quality requirements in the training process, so that each first pause duration meeting the quality requirements can be obtained through the forced alignment model, and the high quality of emotion assessment of the voice is further ensured from the dimension of reading rhythm.

In a specific embodiment, the prosody hierarchy partitioning is performed on the reference text by using a prosody hierarchy partitioning model, and then each first pause duration is obtained according to the reference audio by using a forced alignment model, for example:

firstly, inputting example sentences into a prosody hierarchy dividing model for prosody hierarchy dividing:

inputting: the examination shows that the old has slight cerebral thrombosis.

Performing prosody level division by a prosody level division model:

and (3) outputting: upon examination #3, the elderly had #1 mild #2 cerebral thrombosis # 4.

Then, through the forced alignment model, according to the reference audio, it can know which word each speech frame belongs to, i.e. the number of speech frames corresponding to each word, because the speech frames have a fixed window length and frame shift (generally 25ms window length, 10ms frame shift), and then obtain the time point information of each word, here also including the silence time point information of the pause in the audio, for example:

as a result of forced alignment by the forced alignment model, in "through examination #3, the elderly:

the 1 st frame to the 4 th frame of the speech frame are 'channel' words, the 5 th frame to the 7 th frame of the speech frame are 'detection' words, the 8 th frame to the 9 th frame of the speech frame are 'check' words, the middle is the pause duration of #3, the 10 th frame to the 15 th frame are 'old' words, the 16 th frame to the 18 th frame are 'human' words, namely, the pause durations of the 'checked', the 'old people', the 'slight' and the 'cerebral thrombosis' in the audio are respectively judged by forced alignment, the mute duration of the mute time point at the pause position is also included, the pause durations of each level of the reference audio are known according to the mute duration at the pause position of the reference audio, and the pause durations of each level of the reference audio are the first duration pauses of each rhythm level.

In a specific implementation manner, the forced alignment model of the speech evaluation method provided by the embodiment of the present disclosure may be trained in the following manner:

the training step of the forced alignment model comprises the following steps:

1) and acquiring a training text and a training audio corresponding to the training text, wherein each character of the training text is marked with a reference audio frame corresponding to the character.

In order to train the forced alignment model, a training text and a training audio corresponding to the training text need to be acquired, each character of the training text is labeled with a reference audio frame corresponding to the character, and specifically, the correspondence between the reference audio frame and the character can be acquired in a manual labeling manner.

2) And acquiring a predicted audio frame corresponding to each character according to the training text and the training audio corresponding to the training text through the forced alignment model.

The forced alignment model may obtain the predicted audio frame corresponding to each of the characters according to the training text and the training audio corresponding to the training text, but at this time, the obtained predicted audio frame corresponding to the character may be different from the reference audio frame, and further adjustment is required.

3) And acquiring a second loss according to the predicted audio frames corresponding to the same character and the reference audio frames, and adjusting the parameters of the forced alignment model according to the second loss until the second loss meets a preset threshold value to obtain the trained forced alignment model.

And acquiring a second loss according to the predicted audio frame and the reference audio frame corresponding to the same character, if the second loss does not meet a preset threshold, adjusting parameters of the forced alignment model according to the second loss, and acquiring the predicted audio frame again until the second loss meets the preset threshold, namely the similarity of the predicted audio frame and the reference audio frame corresponding to the same character meets the requirement, so as to obtain the trained forced alignment model.

Therefore, in the training process of the forced alignment model, the model can learn the corresponding relation between the audio frame and the characters by using the reference audio frame marked by each character of the training text, so that when the audio and the text to be aligned are aligned, the audio frame corresponding to the character and meeting the accuracy requirement can be obtained.

As can be seen, in the process of obtaining each first pause duration, prosody hierarchy division is performed on the reference text through a trained prosody hierarchy division model to obtain each prosody hierarchy of the reference text and pause positions of each prosody hierarchy, and then pause durations of the pause positions of each prosody hierarchy in the reference audio corresponding to the reference text are determined by using a trained forced alignment model to obtain each first pause duration of each prosody hierarchy. The prosodic hierarchy classification model and the forced alignment model have strict quality requirements in the training process, so that the high quality of each prosodic hierarchy of the obtained reference text and the pause position of each prosodic hierarchy is ensured, and the high quality of each first pause duration is further ensured, therefore, the high quality of emotion expression evaluation of the audio to be tested can be ensured in the recitation evaluation around the emotion evaluation from the dimensionality of the reading rhythm.

Step S122: and acquiring the reference pause duration corresponding to the prosody hierarchy according to the first pause durations of the same prosody hierarchy to obtain the reference pause durations of different prosody hierarchies.

After obtaining each first pause duration, obtaining the reference pause duration of each prosody level, it is easy to understand that the number of the reference pause durations is the same as the type of the prosody level, and in one embodiment, there are four prosody levels, so that the number of the reference pause durations is also four, that is, one prosody level corresponds to one reference pause duration.

Since the reference audio meets the predetermined quality requirement, the obtained first pause durations of different prosody levels also meet the quality requirement on the basis, and therefore, the obtained reference pause durations of different prosody levels also meet the quality requirement according to the first pause durations of the same prosody level.

In another specific implementation, the reference pause duration in the speech evaluation method provided by the embodiment of the present disclosure may include the following:

an average, a highest number of occurrences, or a weighted average of each of the first pause durations for the same prosody level.

Continuing with the above case, the reference pause duration is obtained by using the average value of the first pause durations of the same prosody level:

example 1: li certain #1 website #3 was soon #2 verified by police #1 # 4.

Example 2: child #1 hospital #3, beep #1 and dad #2 together # 4.

Obtaining the first pause duration at each pause level in the example sentence according to the forced alignment model and the reference audio, in the 3 cases, #1 pauses for five places, if the reference pause duration of the prosody level refers to an average value of the first pause durations in one specific embodiment, the reference pause duration of #1 may be an average value of the first pause durations of #1, and correspondingly, the average values of the first pause durations at #2, #3, and #4 in the reference text are taken to obtain the reference pause durations of #2, #3, and #4, which are respectively denoted as b, c, and d.

Obtaining the reference pause duration of each prosody level:

the base stall duration for tier #1 is: a.

The base stall duration for tier #2 is: b, the step (a).

The base stall duration for tier #3 is: and c, performing the following steps.

The base stall duration for tier #4 is: d.

of course, in order to ensure accuracy, the first pause durations are obtained for a large number of reference texts and reference audios, and for this purpose, the average value of a large number of #1 first pause durations is obtained to obtain #1 reference pause durations, and the average value of a large number of #2 first pause durations is obtained to obtain #2 reference pause durations; acquiring a large number of average values of the first pause durations of #3 to obtain the reference pause duration of # 3; and acquiring a large number of average values of the first pause time lengths of the #4 to obtain the reference pause time length of the # 4.

In other embodiments, the reference pause duration may also be the highest value or a weighted average of the occurrence times of each of the first pause durations.

Specifically, when the reference pause duration is the highest value of the occurrence times of each first pause duration, for example: counting the first pause duration of the #1 prosody hierarchy, wherein specific values comprise t1, t2 and t3, the number of t1 is the largest, and then t1 is taken as the reference pause duration of the #1 prosody hierarchy.

When the reference pause duration is a weighted average of the respective first pause durations, for example: counting the first pause duration of the #1 prosody hierarchy, wherein specific values comprise t1, t2 and t3, the number of t1 is the largest, the number of t3 is the smallest, the weight of t1 is the largest, and the weight of t3 is the smallest, then calculating the weighted average of the three, and taking the weighted average as the reference pause duration of the #1 prosody hierarchy.

The mean value of the first pause durations of the same prosody level is obtained, so that the reference pause durations of the prosody level can be conveniently obtained, and the calculation mode is simple; obtaining the highest occurrence frequency value of each first pause duration of the same prosody level as a reference pause duration, wherein the obtained reference pause duration is more fit for the actual situation; by obtaining the weighted average of each first pause duration of the same prosody level as the reference pause duration, the obtained reference pause duration can be combined with each pause duration, but the influence on the reference pause duration is different, so that the accuracy of the obtained reference pause duration is ensured.

Therefore, in order to more accurately evaluate the voice of the audio to be tested, the acquisition of the reference pause duration is particularly important, and as the reference audio meets the preset quality requirement, therefore, the high quality of the voice evaluation of the audio frequency to be tested can be ensured, and as the reference audio frequency meets the preset quality requirement, on this basis, therefore, the obtained first pause durations of the respective different prosody levels also satisfy the quality requirement, and therefore, obtaining the reference pause duration of different prosody levels according to the first pause durations of the same prosody level and meeting the quality requirement, meanwhile, the reading rhythm of the reader can be well reflected by each first pause duration and the reference pause durations of different rhythm levels, so that, the emotion evaluation of the voice can be realized from the dimensionality of reading rhythm in the reading evaluation surrounding the emotion evaluation.

And obtaining the evaluation pause duration corresponding to the same prosody level and a preset reference pause duration, and obtaining the level pause evaluation result of each prosody level based on the evaluation pause duration and the preset reference pause duration.

In one embodiment, in order to obtain the result of evaluating the level pause of each prosody level, the following steps are performed:

obtaining the maximum value and the minimum value of the evaluation pause duration and the reference pause duration, obtaining the ratio of the minimum value to the maximum value, and obtaining the level pause evaluation result;

or obtaining an absolute value of the difference between the evaluation pause duration and the reference pause duration, obtaining a maximum value between the evaluation pause duration and the reference pause duration, and obtaining a ratio of the absolute value to the maximum value to obtain the level pause evaluation result.

For convenience of understanding, the following will be described in conjunction with the foregoing cases, respectively:

according to the evaluation pause duration of each rhythm level of the audio to be tested: levels #1, #2, #3, #4 are denoted as A, B, C, D and the reference pause duration of the predetermined prosodic level, respectively: levels #1, #2, #3, #4 are denoted as a, b, c, d, respectively.

When a mode of obtaining the maximum value and the minimum value of the evaluation pause duration and the reference pause duration, obtaining the ratio of the minimum value to the maximum value and obtaining the level pause evaluation result is adopted:

the formula is as follows:

wherein: a1 is the result of the level pause evaluation of level #1, B2 is the result of the level pause evaluation of level #2, C3 is the result of the level pause evaluation of level #3, and D4 is the result of the level pause evaluation of level # 4.

The values of A1, B2, C3 and D4 are between [0,1], the score is closer to 1, and the evaluation pause duration is closer to the reference pause duration, and the pronunciation is standard; the closer the score is to 0, the greater the difference between the two, and the irregular pronunciation.

In another specific implementation manner, when an absolute value of a difference between the evaluation pause duration and the reference pause duration is obtained, a maximum value of the evaluation pause duration and the reference pause duration is obtained, and a ratio of the absolute value to the maximum value is used as a level pause evaluation result, a formula is as follows:

The values of A1, B2, C3 and D4 are between [0,1], the score is closer to 0, and the evaluation pause duration is closer to the reference pause duration, and the pronunciation is standard; the closer the score is to 1, the greater the difference between the two, and the irregular pronunciation.

When the mode of obtaining the maximum value and the minimum value of the evaluation pause duration and the reference pause duration and obtaining the ratio of the minimum value to the maximum value is adopted to obtain the hierarchy pause evaluation result, the used formula is simple, and the hierarchy pause evaluation result can be obtained only by evaluating the ratio of the pause duration to the reference pause duration.

In another specific implementation manner, when the absolute value of the difference between the evaluation pause duration and the reference pause duration is obtained, the maximum value of the evaluation pause duration and the reference pause duration is obtained, and the ratio of the absolute value to the maximum value is used as the hierarchy pause evaluation result, the absolute value of the difference between the evaluation pause duration and the reference pause duration is obtained first, so that the hierarchy pause evaluation result is more clear, and the pronunciation is more irregular when the difference between the evaluation pause duration and the reference pause duration is larger.

It can be seen that the speech evaluation method provided by the present disclosure may adopt different methods for obtaining the level pause evaluation result in specific embodiments, which provides diversity and flexibility of the method for obtaining the level pause evaluation result.

Step S13: and obtaining an audio rhythm evaluation result of the audio to be tested according to the level pause evaluation result of each rhythm level.

And further obtaining the audio rhythm evaluation result of the whole audio to be tested after obtaining the pause evaluation result of each level of the audio to be tested.

In a specific embodiment, in order to evaluate the speech to be tested, the step of obtaining the audio prosody evaluation result of the audio to be tested according to the level pause evaluation result of each prosody level includes:

and averaging the pause evaluation results of the same prosody level of the audio to be tested to obtain the average scores of different prosody levels of the audio to be tested.

Obtaining weights for each of the different prosodic levels.

And obtaining a prosody evaluation result of the audio to be tested according to the average score of different prosody levels of the audio to be tested, the weight of each different prosody level and the number of categories of different prosody levels.

According to the evaluation pause duration of each rhythm level of the audio to be tested: levels #1, #2, #3, #4 are denoted as A, B, C, D and the reference pause duration of the predetermined prosodic level, respectively: the levels #1, #2, #3 and #4 are respectively marked as a, b, c and d, and the obtained level pause evaluation results are as follows: a1 is the result of the level pause evaluation of level #1, B2 is the result of the level pause evaluation of level #2, C3 is the result of the level pause evaluation of level #3, and D4 is the result of the level pause evaluation of level # 4.

It is easily understood that a plurality of levels #1, #2, #3, and #4 may be included in the audio to be tested.

For example:

example 1: li certain #1 website #3 was soon #2 verified by police #1 # 4.

Two levels #1 are included in example 1.

Assuming that the audio to be tested includes n1 levels #1, n2 levels #2, n3 levels #3, and n4 levels #4, first, averaging the pause evaluation results of the same prosody level of the audio to be tested to obtain average scores of different prosody levels of the audio to be tested, that is: averaging different A1 corresponding to n1 levels #1 to obtain A11, averaging different B2 corresponding to n2 levels #2 to obtain B22, averaging different C3 corresponding to n3 levels #3 to obtain C33, and averaging different D4 corresponding to n4 levels #4 to obtain D44.

Then, weights of different prosodic levels of level #1, level #2, level #3, and level #4 are obtained, assuming that the weights of level #1, level #2, level #3, and level #4 are: r1, r2, r3, and r4, in a specific embodiment, the weights of each prosody level can be adjusted according to different requirements to obtain a score emphasizing a certain level, for example, in a specific embodiment, the prosodic phrase layer and the intonation phrase layer can better reflect the rhythm of pronunciation, and the weights of the two prosody levels can be adjusted up appropriately so as to obtain a more accurate evaluation result.

And finally, obtaining a prosody evaluation result of the audio to be tested according to the average scores of the different prosody levels of the audio to be tested, the weight of each different prosody level and the number of different prosody levels.

The calculation formula is as follows:

if there is no prosody level #3 in a reference text, the corresponding formula is:

of course, in another specific embodiment, if an average value of multiple pause durations of the same prosody hierarchy is obtained in the process of obtaining the evaluation pause durations of each prosody hierarchy, there is only one hierarchy pause evaluation result corresponding to one prosody hierarchy, and at this time, in order to evaluate the speech to be evaluated, the method specifically includes:

obtaining weights of different prosody levels;

and obtaining a rhythm evaluation result of the audio to be tested according to the level pause evaluation results of different rhythm levels of the audio to be tested, the weight of each different rhythm level and the number of categories of different rhythm levels.

It can be seen that, in the process of obtaining the prosody evaluation result of the audio to be tested, the average value of the pause evaluation results of the same prosody level is obtained first, so that it can be ensured that the pause evaluation result of the same prosody level plays a role in the final prosody evaluation result, none of which is unimportant to be ignored, then different weights of different prosody levels are obtained, and in a specific embodiment, the weight of each prosody level can be adjusted according to the requirement, so that the accuracy of the obtained final prosody evaluation result is ensured, finally the prosody evaluation result of the audio to be tested is obtained according to the average score of different prosody levels of the audio to be tested, the weights of different prosody levels, and the number of categories of different prosody levels, in a calculation formula, the number of categories of prosody levels obtained from the text to be tested does not affect the accuracy of the calculation result, therefore, the accuracy of the rhythm evaluation result of the audio to be tested obtained by various different audio to be tested is ensured.

In summary, it can be seen that the speech evaluation method provided in the embodiment of the present disclosure obtains the level pause evaluation result of each prosody level of the audio to be evaluated by using the evaluation pause duration of each prosody level of the audio to be evaluated and combining the reference pause duration corresponding to the same prosody level, and further obtains the audio prosody evaluation result of the whole audio to be evaluated, and by comparing the mild and fast tempo represented by the pause durations of different prosody levels with the mild and fast tempo of the reference audio, the evaluation of the tempo of the audio to be evaluated can be realized, and further, the evaluation of the emotion of the audio to be evaluated can be realized, and in the comparison of the tempos, different prosody levels are distinguished for respective evaluation, and the accuracy of the evaluation of the emotion of the audio to be evaluated can be improved.

In the following, the speech evaluation apparatus provided by the embodiment of the present disclosure is introduced, and the speech evaluation apparatus described below may be regarded as a functional module architecture that is required to be set by an electronic device (e.g., a PC) to respectively implement the speech evaluation method provided by the embodiment of the present disclosure. The contents of the speech evaluation device described below can be referred to in correspondence with the contents of the speech evaluation method described above, respectively.

Fig. 5 is a block diagram of a speech evaluation device provided in an embodiment of the present disclosure, where the speech evaluation device is applicable to both a client and a server, and referring to fig. 5, the speech evaluation device may include:

the audio acquiring unit 50 to be tested is suitable for acquiring the audio to be tested and the reference text corresponding to the audio to be tested.

An evaluation pause duration obtaining unit 51 adapted to obtain an evaluation pause duration of each rhythm level of the audio to be detected according to the audio to be detected and the reference text;

a hierarchy pause evaluation result obtaining unit 52 adapted to obtain a hierarchy pause evaluation result of each prosody hierarchy according to an evaluation pause duration corresponding to the same prosody hierarchy and a predetermined reference pause duration;

and the audio prosody evaluation result obtaining unit 53 is adapted to obtain the audio prosody evaluation result of the audio to be tested according to the level pause evaluation result of each prosody level.

It can be seen that, in the speech evaluation device provided in the embodiment of the present disclosure, during speech evaluation, first, the audio obtaining unit 50 to be tested obtains an audio to be tested and a reference text corresponding to the audio to be tested, and then, the evaluation pause duration obtaining unit 51 obtains the evaluation pause duration of each prosody level of the audio to be tested according to the audio to be tested and the reference text; a hierarchy pause evaluation result obtaining unit 52, configured to obtain a hierarchy pause evaluation result of each prosody hierarchy according to an evaluation pause duration corresponding to the same prosody hierarchy and a predetermined reference pause duration; and finally, an audio prosody evaluation result obtaining unit 53 obtains an audio prosody evaluation result of the audio to be tested according to the level pause evaluation result of each prosody level.

Therefore, the speech evaluation device provided by the embodiment of the disclosure obtains the level pause evaluation results of each rhythm level of the audio to be evaluated by the evaluation pause duration of each rhythm level of the audio to be evaluated and combining the reference pause duration corresponding to the same rhythm level, and further obtains the audio rhythm evaluation result of the whole audio to be evaluated.

For speech evaluation, the audio acquiring unit under test 50: and acquiring the audio to be tested and a reference text corresponding to the audio to be tested.

Evaluation pause duration acquiring unit 51: and acquiring the evaluation pause duration of each rhythm level of the audio to be detected according to the audio to be detected and the reference text.

Statement layer: typically divided by punctuation.

The text processing unit 54 to be tested acquires each prosody level of the reference text and the pause position of each prosody level; determining pause time of pause positions of each rhythm level according to the audio to be tested to obtain each evaluation pause time; when determining that each prosody level and each stopping position of each prosody level of the reference text are not obtained, performing prosody level division on the reference text to obtain each prosody level of the reference text and each stopping position of each prosody level.

inputting: children's hospital, beep and dad together.

Performing prosody level division by a prosody level division model:

and (3) outputting: child #1 hospital #3, beep #1 and dad #2 together # 4.

Obtaining the evaluation pause duration of each rhythm level:

the evaluation pause duration of the first tier #1 is: A1.

the evaluation pause duration for the second tier #1 is: A2.

the evaluation pause duration for tier #2 is: B.

the evaluation pause duration for tier #3 is: C.

the evaluation dwell duration for tier #4 is: D.

Obtaining the evaluation pause duration of each rhythm level:

the evaluation pause duration for tier #1 is: A.

the evaluation pause duration for tier #2 is: B.

the evaluation pause duration for tier #3 is: C.

the evaluation dwell duration for tier #4 is: D.

The tier stall evaluation result acquisition unit 52: and obtaining the level pause evaluation result of each prosody level according to the evaluation pause duration corresponding to the same prosody level and a preset reference pause duration.

In a specific embodiment, to facilitate obtaining the predetermined reference pause duration, please refer to fig. 6, where fig. 6 is another block diagram of the speech evaluation device provided in the embodiment of the present disclosure.

a reference audio acquiring unit 60 that acquires each reference audio and each reference text corresponding to each reference audio, respectively;

The first pause duration acquiring unit 61: and obtaining first pause duration of each different prosody level of each reference audio according to the reference audio and the reference text which correspond to each other.

The first execution unit 62: executing the step of obtaining first pause durations of different prosody levels of the reference audios according to the reference audios and the reference texts which correspond to each other, wherein the step includes:

dividing prosody levels of the reference text to obtain each prosody level of the reference text and a pause position of each prosody level; and determining the pause duration of the pause position of each prosody hierarchy according to the reference audio corresponding to the reference text to obtain each first pause duration.

And carrying out prosody hierarchy division on the reference text to obtain each prosody hierarchy of the reference text and a pause position of each prosody hierarchy.

The second execution unit 63: the step of performing prosody hierarchy division on the reference text to obtain each prosody hierarchy of the reference text and a pause position of each prosody hierarchy includes:

For example:

inputting:

example 1: the web site of lie is quickly sealed by police.

Example 2: children's hospital, beep and dad together.

Example 3: the examination shows that the old has slight cerebral thrombosis.

And (3) outputting:

example 1: li certain #1 website #3 was soon #2 verified by police #1 # 4.

Example 2: child #1 hospital #3, beep #1 and dad #2 together # 4.

Prosodic hierarchy classification model training unit 64: the prosodic hierarchy partitioning model is suitable for training, and can be trained through the following steps of:

1) and obtaining a sample text training set, wherein the sample text training set comprises a to-be-detected sample text and a reference sample text which correspond to each other, and each reference prosody level is marked on the reference sample text.

2) And acquiring the prediction prosody hierarchy of the sample text to be detected according to the sample text to be detected by utilizing the prosody hierarchy dividing model.

3) And determining a first loss of the sample text to be detected according to the prediction prosody level and the reference prosody level.

4) Whether the first loss meets a predetermined threshold, if yes, step 6) is executed, and if no, step 5) is executed.

And judging the quality of the predicted prosody hierarchy by using the first loss, and if the first loss does not meet a predetermined threshold, executing the step 5) until the first loss threshold meets a predetermined threshold, executing the step 6).

5) And adjusting parameters of the prosodic hierarchy partitioning model.

6) And obtaining the prosody hierarchy classification model after training.

And determining the pause duration of the pause position of each prosody hierarchy according to the reference audio corresponding to the reference text to obtain each first pause duration.

The third execution unit 65: executing the step of determining the pause duration of the pause position of each prosody hierarchy according to the reference audio corresponding to the reference text to obtain each first pause duration, including:

inputting: the examination shows that the old has slight cerebral thrombosis.

Performing prosody level division by a prosody level division model:

Forced alignment model training unit 66: the forced alignment model can be trained:

the training step of the forced alignment model comprises the following steps:

The reference pause period acquiring unit 67: the method is suitable for obtaining the reference pause duration corresponding to the prosody hierarchy according to the first pause durations of the same prosody hierarchy to obtain the reference pause durations of different prosody hierarchies, wherein the reference pause durations of the prosody hierarchy comprise an average value, a maximum occurrence frequency value or a weighted average value of the first pause durations of the same prosody hierarchy.

And acquiring the reference pause duration corresponding to the prosody hierarchy according to the first pause durations of the same prosody hierarchy to obtain the reference pause durations of different prosody hierarchies.

example 1: li certain #1 website #3 was soon #2 verified by police #1 # 4.

Example 2: child #1 hospital #3, beep #1 and dad #2 together # 4.

Obtaining the reference pause duration of each prosody level:

the base stall duration for tier #1 is: a.

The base stall duration for tier #2 is: b, the step (a).

The base stall duration for tier #3 is: and c, performing the following steps.

The base stall duration for tier #4 is: d.

The fourth execution unit 55: the method for obtaining the level pause evaluation result of each prosody level according to the evaluation pause duration corresponding to the same prosody level and the preset reference pause duration comprises the following steps: obtaining the maximum value and the minimum value of the evaluation pause duration and the reference pause duration, obtaining the ratio of the minimum value to the maximum value, and obtaining the level pause evaluation result; or obtaining an absolute value of the difference between the evaluation pause duration and the reference pause duration, obtaining a maximum value between the evaluation pause duration and the reference pause duration, and obtaining a ratio of the absolute value to the maximum value to obtain the level pause evaluation result.

the formula is as follows:

In another specific implementation manner, when an absolute value of a difference between the evaluation pause duration and the reference pause duration is obtained, a maximum value between the evaluation pause duration and the reference pause duration is obtained, and a ratio of the absolute value to the maximum value is used as a level pause evaluation result, a formula is as follows:

The audio prosody evaluation result obtaining unit 53: and the audio rhythm evaluation result of the audio to be tested is obtained according to the level pause evaluation result of each rhythm level.

In order to evaluate the speech to be tested, the fifth execution unit 56: the step of obtaining the audio rhythm evaluation result of the audio to be tested according to the level pause evaluation result of each rhythm level is suitable for executing, and comprises the following steps: averaging the pause evaluation results of the same prosody level of the audio to be tested to obtain average scores of different prosody levels of the audio to be tested; obtaining weights of different prosody levels; and obtaining a prosody evaluation result of the audio to be tested according to the average score of different prosody levels of the audio to be tested, the weight of each different prosody level and the number of categories of different prosody levels.

The method specifically comprises the following steps:

averaging the pause evaluation results of the same prosody level of the audio to be tested to obtain average scores of different prosody levels of the audio to be tested;

obtaining weights of different prosody levels;

For example:

example 1: li certain #1 website #3 was soon #2 verified by police #1 # 4.

Two levels #1 are included in example 1.

The calculation formula is as follows:

obtaining weights of different prosody levels;

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

In the computer-executable instructions stored in the non-transitory computer-readable storage medium provided by the exemplary embodiment of the present disclosure, when performing voice evaluation, first, a to-be-tested audio and a reference text corresponding to the to-be-tested audio are obtained; then obtaining the evaluation pause duration of each rhythm level of the audio to be tested according to the audio to be tested and the reference text; obtaining the level pause evaluation result of each rhythm level according to the evaluation pause duration corresponding to the same rhythm level and a preset reference pause duration; and finally, acquiring an audio rhythm evaluation result of the audio to be tested according to the level pause evaluation result of each rhythm level. It can be seen that the speech evaluation method provided by the embodiment of the present disclosure obtains the level pause evaluation results of each prosody level of the audio to be evaluated by using the evaluation pause durations of each prosody level of the audio to be evaluated and combining the reference pause durations corresponding to the same prosody level, and further obtains the audio prosody evaluation result of the whole audio to be evaluated.

Referring to fig. 7, a block diagram of a structure of an electronic device 800, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the electronic device 800, and the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 807 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 804 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above. For example, in some embodiments, methods S10-S13 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. In some embodiments, the computing unit 801 may be configured to perform the methods S10-S13 by any other suitable means (e.g., by way of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Although the disclosed embodiments are disclosed above, the disclosure is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the disclosure, and it is intended that the scope of the disclosure be limited only by the claims appended hereto.

Claims

1. A speech evaluation method, comprising:

obtaining an audio rhythm evaluation result of the audio to be tested according to the level pause evaluation result of each rhythm level;

the step of obtaining the reference pause duration comprises the following steps:

acquiring each reference audio and each reference text corresponding to each reference audio respectively;

obtaining first pause durations of different prosody levels of the reference audios according to the reference audios and the reference texts which correspond to each other;

2. The speech assessment method according to claim 1, wherein the step of obtaining the first pause duration of each of the different prosody levels of each of the reference audios based on the reference audios and the reference texts corresponding to each other comprises:

dividing prosody levels of the reference text to obtain each prosody level of the reference text and a pause position of each prosody level;

3. The speech assessment method according to claim 2, wherein the step of performing prosody hierarchy classification on the reference text to obtain each prosody hierarchy of the reference text and a pause position of each prosody hierarchy comprises:

and performing prosodic hierarchy division on the reference text by using a prosodic hierarchy division model to obtain each prosodic hierarchy of the reference text and a pause position of each prosodic hierarchy, wherein the prosodic hierarchy comprises a prosodic word layer, a prosodic phrase layer, a intonation phrase layer and a sentence layer.

4. The method for speech assessment according to claim 3, wherein the training process of the prosodic hierarchy classification model comprises:

acquiring a sample text training set, wherein the sample text training set comprises a to-be-detected sample text and a reference sample text which correspond to each other, and each reference prosody level is marked on the reference sample text;

and obtaining a prediction prosody level of the to-be-tested sample text according to the to-be-tested sample text by using the prosody level division model, determining a first loss of the to-be-tested sample text according to the prediction prosody level and the reference prosody level, and adjusting parameters of the prosody level division model according to the first loss of the to-be-tested sample text until the first loss of the to-be-tested sample text meets a preset threshold value, so as to obtain a trained prosody level division model.

5. The speech assessment method according to claim 2, wherein the step of determining the pause duration at the pause position of each prosody hierarchy based on the reference audio corresponding to the reference text to obtain each first pause duration comprises:

6. The speech assessment method of claim 5, wherein the training process of the forced alignment model comprises:

acquiring a training text and a training audio corresponding to the training text, wherein each character of the training text is marked with a reference audio frame corresponding to the character;

acquiring a predicted audio frame corresponding to each character according to the training text and a training audio corresponding to the training text through the forced alignment model;

and acquiring a second loss according to the predicted audio frames corresponding to the same character and the reference audio frames, and adjusting parameters of the forced alignment model according to the second loss until the second loss meets a preset threshold value, so as to obtain the trained forced alignment model.

7. The method for speech assessment according to claim 1, wherein the prosody scale reference pause duration comprises an average, a highest number of occurrences, or a weighted average of each of the first pause durations of the same prosody scale.

8. The speech evaluation method according to any one of claims 1 to 7, wherein the step of obtaining the evaluation pause duration of each prosody level of the audio to be evaluated according to the audio to be evaluated and the reference text comprises:

acquiring each prosody level of the reference text and a pause position of each prosody level;

and determining the pause duration of the pause position of each rhythm level according to the audio to be tested to obtain each evaluation pause duration.

9. The speech profiling method according to claim 8, wherein the step of obtaining each prosody level of the reference text and the pause position of each prosody level comprises:

when determining that each prosody level and each stopping position of each prosody level of the reference text are not obtained, performing prosody level division on the reference text to obtain each prosody level of the reference text and each stopping position of each prosody level.

10. The speech assessment method according to claim 1, wherein the method for obtaining the assessment result of level pause for each prosody level according to the assessment pause duration corresponding to the same prosody level and the predetermined reference pause duration comprises:

11. The speech evaluation method according to claim 1, wherein the step of obtaining the audio prosody evaluation result of the audio to be tested according to the level pause evaluation result of each prosody level comprises:

obtaining weights of different prosody levels;

12. A speech evaluation apparatus, comprising:

the audio rhythm evaluation result acquisition unit is suitable for acquiring an audio rhythm evaluation result of the audio to be tested according to the level pause evaluation result of each rhythm level;

the reference audio acquiring unit is suitable for acquiring each reference audio and each reference text corresponding to each reference audio;

a first pause duration obtaining unit adapted to obtain first pause durations of different prosody levels of the reference audios according to the reference audios and the reference texts which correspond to each other;

and the reference pause duration acquisition unit is suitable for acquiring the reference pause durations corresponding to the prosody levels according to the first pause durations of the same prosody level to obtain the reference pause durations of different prosody levels, wherein the reference pause durations of the prosody levels comprise an average value, a highest occurrence frequency value or a weighted average value of the first pause durations of the same prosody level.

13. The speech evaluation apparatus according to claim 12, further comprising:

the text processing unit to be tested is suitable for acquiring each prosody level of the reference text and the pause position of each prosody level; determining pause time of pause positions of each rhythm level according to the audio to be tested to obtain each evaluation pause time; when determining that each prosody level and each stopping position of each prosody level of the reference text are not obtained, performing prosody level division on the reference text to obtain each prosody level of the reference text and each stopping position of each prosody level.

14. An electronic device, comprising:

a processor, and a memory storing a program, wherein the program comprises instructions that, when executed by the processor, cause the processor to perform the method of any of claims 1-11.

15. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-11.