CN110797049A

CN110797049A - Voice evaluation method and related device

Info

Publication number: CN110797049A
Application number: CN201910987884.7A
Authority: CN
Inventors: 杨康; 吴奎; 朱群; 江勇军; 宋雪洁
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2020-02-14
Anticipated expiration: 2039-10-17
Also published as: CN110797049B

Abstract

The embodiment of the application discloses a voice evaluation method and a related device. Wherein the method comprises the following steps: acquiring target voice obtained by reading a target text by a user, and acquiring reference voice of the target text; determining an evaluation score of the target voice according to the reference voice and a preset score tolerance, wherein the score tolerance of a pronunciation confusion event of at least one voice unit in the target voice is not zero, and the pronunciation confusion event refers to an event that one voice unit is confused into other voice units; and outputting the evaluation score. Therefore, by implementing the technical scheme provided by the application, the flexibility and the compatibility of the electronic equipment for voice evaluation are improved.

Description

Voice evaluation method and related device

Technical Field

The application relates to the technical field of electronic equipment, in particular to a voice evaluation method and a related device.

Background

Because the inherent pronunciation habits of people in different countries and regions and the dialects of all regions cause the existence of the specific confusable pronunciation units, people in specific regions or crowds in speech communication are difficult to subjectively feel the differences among the pronunciation units which are difficult to distinguish, and meanwhile, normal speech communication is not influenced, and the confusable pronunciation units can be considered to be correct pronunciation to a certain extent.

However, in practical application, the conventional speech evaluation technique does not consider the influence of confusable pronunciation units on the evaluation result. The traditional voice evaluating system uses the same set of scoring standard for all regions and people and does not have evaluating functions with different scale requirements, so that the problems that the evaluating result of the traditional voice evaluating system is not consistent with the subjective feeling of a user and the traditional voice evaluating system cannot adapt to different regions, different people and different evaluating targets frequently occur.

Disclosure of Invention

The embodiment of the application provides a voice evaluation method and a related device, aiming to improve the flexibility and compatibility of voice evaluation of equipment.

In a first aspect, an embodiment of the present application provides a speech evaluation method, including:

acquiring target voice obtained by reading a target text by a user, and acquiring reference voice of the target text;

determining an evaluation score of the target voice according to the reference voice and a preset score tolerance, wherein the score tolerance of a pronunciation confusion event of at least one voice unit in the target voice is not zero, and the pronunciation confusion event refers to an event that one voice unit is confused into other voice units;

and outputting the evaluation score.

In a second aspect, an embodiment of the present application provides a speech evaluation apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring target voice obtained by reading a target text by a user and acquiring reference voice of the target text;

the determining unit is used for determining the evaluation score of the target voice according to the reference voice and a preset score tolerance, wherein the score tolerance of a pronunciation confusion event of at least one voice unit in the target voice is not zero, and the pronunciation confusion event refers to an event that one voice unit is confused into other voice units;

and the output unit is used for outputting the evaluation score.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for executing steps in any method of the first aspect of the embodiment of the present application.

In a fourth aspect, the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program makes a computer perform part or all of the steps described in any one of the methods of the first aspect of the present application.

In a fifth aspect, the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform some or all of the steps as described in any one of the methods of the first aspect of the embodiments of the present application. The computer program product may be a software installation package.

Therefore, in the embodiment of the application, the electronic device can determine the evaluation score of the target voice to be tested according to the reference voice of the reading text and the preset score tolerance of different regions and different people, so that the problems that the evaluation result and the subjective feeling of a user are not consistent due to the fact that the same set of score standard is used and the score standard cannot be freely customized according to different regions and different people can be effectively avoided, and the flexibility and the compatibility of the electronic device for voice evaluation are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of an architecture of a speech evaluation system according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram of a speech evaluation method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 4 is a block diagram of functional units of a speech evaluation apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, fig. 1 is a schematic diagram of a speech evaluating system 100, where the speech evaluating system 100 includes a speech acquiring device 101 and a speech processing device 102, the speech acquiring device 101 is connected to the speech processing device 102, the speech acquiring device 101 is configured to acquire speech data and send the speech data to the speech processing device 102 for processing, the speech processing device 102 is configured to process the speech data and output a processing result, and the speech evaluating system 100 may include an integrated single device or multiple devices, and for convenience of description, the speech evaluating system 100 is referred to as an electronic device in the present application. It will be apparent that the electronic device may include various handheld devices, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem having wireless communication capabilities, as well as various forms of User Equipment (UE), Mobile Stations (MS), terminal Equipment (terminal device), and the like.

In practical application, the traditional speech evaluation technology does not consider the influence of confusable pronunciation units on evaluation results. The traditional voice evaluating system uses the same set of scoring standard for all regions and people and does not have evaluating functions with different scale requirements, so that the problems that the evaluating result of the traditional voice evaluating system is not consistent with the subjective feeling of a user and the traditional voice evaluating system cannot adapt to different regions, different people and different evaluating targets frequently occur.

Based on this, the embodiment of the present application provides a speech evaluation method to solve the above problem, and the following describes the embodiment of the present application in detail.

Referring to fig. 2, fig. 2 is a schematic flowchart of a speech evaluation method provided in an embodiment of the present application, and is applied to the electronic device shown in fig. 1, where as shown in fig. 2, the speech evaluation method includes:

s201, the electronic equipment acquires target voice obtained by reading the target text aloud by the user and acquires reference voice of the target text.

The electronic device obtains a target voice obtained by reading a target text aloud by a user through the voice obtaining device, for example, the voice obtaining device is a sound collecting device, and when the user reads the target text aloud towards the sound electronic device, the sound collecting device collects the voice of the user when reading aloud, so as to obtain the target voice; or the external equipment guides the collected voice into the electronic equipment through an interface, and the voice acquisition device acquires the target voice through the interface; or may be otherwise, and is not limited solely thereto.

The reference voice of the target text may be pre-stored in the electronic device, or may be acquired by the voice acquiring apparatus, which is not limited herein.

In addition, it should be noted that the reference speech of the target text refers to a standard speech obtained by a professional reading the target text or other text including all speech units in the target text, and the pronunciation of each speech unit in the reference speech is correct.

S202, the electronic equipment determines an evaluation score of the target voice according to the reference voice and a preset score tolerance, wherein the score tolerance of a pronunciation confusion event of at least one voice unit in the target voice is not zero, and the pronunciation confusion event refers to an event that one voice unit is confused into other voice units.

Wherein the phonetic units comprise phonemes, syllables, words, etc.

Wherein, the determining the evaluation score of the target voice according to the reference voice and the preset score tolerance comprises: performing voice unit boundary segmentation on the target voice to obtain a segmentation boundary of each first voice unit in the target voice; constructing a weight coefficient matrix according to the voice unit confusion matrix and the grading tolerance, wherein the weight coefficient matrix comprises weight coefficients ij when the ith second voice unit in the preset voice unit set is confused into the jth second voice unit, and i and j are positive integers; determining pronunciation accuracy of each first voice unit according to the each first voice unit, the reference voice and the weight coefficient matrix; and determining the evaluation score of the target voice according to the pronunciation accuracy of each first voice unit.

The numerical value of the weight coefficient ij and the ith second speech unit are confused that the semantic accuracy ij of the jth second speech unit has a first corresponding relationship, the numerical value of the weight coefficient ij and the pronunciation accuracy ij of the ith second speech unit pronounce as the jth second speech unit have a second corresponding relationship, and the first corresponding relationship and the second corresponding relationship obtain the following corresponding relationship through the numerical value of the weight coefficient ij: the pronunciation accuracy ij is in direct proportion to the semantic accuracy ij.

Wherein, the weight coefficient matrix may be a phoneme weight coefficient matrix, a syllable weight coefficient matrix, a word weight coefficient matrix, etc.

Wherein the preset speech unit set may be a phoneme replacement list, a syllable replacement list, a word replacement list, etc.

Wherein each element ij in the speech unit confusion matrix represents the probability that the ith second speech unit is confused into the jth second speech unit, and the speech unit confusion matrix can be a phoneme confusion matrix, a syllable confusion matrix, a word confusion matrix, or the like.

The scoring tolerance refers to a tolerance set by human subjective feeling when a certain voice unit is mixed up into other voice units. The scoring tolerance can be assumed to be n grades, different voice units correspond to the scoring tolerances of different grades and are respectively 0-n-1 grades, when the scoring tolerance is set to be 0, the traditional assessment scheme is adopted, and the scoring tolerance is increased along with the increase of the numerical value from 1-n-1. In some cases, when a target speech unit is pronounced into a confusable speech unit of the target speech unit, the pronunciation is considered correct, i.e., the distinction between the confusable speech units is artificially blurred. For example, the phoneme in English reading [ I ]]And [ i:]、

and [ e]、[e]And [ e_I]、

And

etc. are confusable phonemes and the pronunciation tolerance of these phonemes can be set to n.

Wherein each element in the phonetic unit confusion matrix corresponds to one score tolerance, and the construction of the weight coefficient matrix according to the phonetic unit confusion matrix and the score tolerance comprises: for each element in the phonetic unit confusion matrix, performing the following processing operations to obtain the weight coefficient matrix: obtaining the score tolerance corresponding to the currently processed element; and determining the weight coefficient of the currently processed element according to the currently processed element and the score tolerance corresponding to the currently processed element.

For example, assuming that the phonetic unit confusion matrix is a phoneme confusion matrix PHCM (phone fusion matrix), a weighting coefficient for confusing each phoneme in the phoneme confusion matrix into other phonemes can be calculated according to the phoneme confusion matrix PHCM and a score tolerance set in advance manually, so as to construct a phoneme weighting coefficient matrix. Wherein, the weighting coefficient for mixing the ith phoneme into the jth phoneme can be defined as

The calculation formula is as follows:

wherein, PHCM_ijRepresenting the probability of the ith phoneme being confused into the jth phoneme in the phoneme confusion matrix; PHCM_ij∈[0,1]Therefore, as the score tolerance n increases, the confusing phoneme has smaller weight coefficients.

Performing voice unit boundary segmentation on the target voice to obtain a segmentation boundary of each first voice unit in the target voice, including: building a decoding network according to the target text, the dictionary and the acoustic model; extracting acoustic features of the target voice; forward calculation is carried out according to the acoustic characteristics to obtain a frame state score of each frame of voice in the target voice; and after the target text and the target voice are aligned in the decoding network by the acoustic model according to the frame state score, carrying out voice unit boundary segmentation to obtain the segmentation boundary of each first voice unit in the target voice.

And S203, the electronic equipment outputs the evaluation score.

The electronic device processes the acquired target voice to obtain an evaluation score of the target voice, and then outputs the evaluation score to the user. For example, the electronic device may output in a form of voice broadcasting, or may output in a form of screen display, which is not limited herein.

In a possible example of the present application, prior to constructing a weight coefficient matrix from a speech unit confusion matrix and the score tolerance, the method further comprises constructing the speech unit confusion matrix.

In a specific implementation, the constructing the confusion matrix of the speech unit may include the following two cases:

first, the construction of a statistics-based phonetic unit confusion matrix.

Wherein the construction of the voice unit confusion matrix based on statistics comprises: acquiring a first preset number of segments of historical voices obtained by reading historical texts of different historical users, and acquiring reference voices of the historical texts; performing voice unit boundary segmentation on the first preset number of segments of historical voice to obtain a segmentation boundary of each third voice unit in the first preset number of segments of historical voice; calculating a third likelihood of each third voice unit in the segmentation boundary and each second voice unit to obtain a plurality of third likelihoods corresponding to each third voice unit; arranging a plurality of third likelihood degrees corresponding to each third voice unit according to a descending order, and selecting a maximum third likelihood degree corresponding to each third voice unit; comparing the second voice unit corresponding to the maximum third likelihood corresponding to each third voice unit with the reference voice unit corresponding to each third voice unit in the reference voice of the historical text; and constructing the voice unit confusion matrix according to the comparison result.

Wherein, the constructing the voice unit confusion matrix according to the comparison result comprises: counting the result of the failed comparison of the second voice unit corresponding to each maximum third likelihood to obtain the times of wrongly dividing each second voice unit into other second voice units; constructing a first matrix according to the times that all the second voice units and each second voice unit are wrongly divided into other second voice units; and summing the elements of each row of the first matrix to obtain the frequency of the second voice unit corresponding to the row appearing in the statistics, and dividing each element of the first matrix by the frequency of the second voice unit corresponding to the row of the first matrix appearing in the statistics to obtain the voice unit confusion matrix.

The method comprises the steps of constructing a phoneme confusion matrix, for example, obtaining the reading data of the reading history text on the basis of each region in large batch as a training set, obtaining the segmentation boundary of each phoneme in the reading data of the history text after forced segmentation, calculating the likelihood of each phoneme in the segmentation boundary and each phoneme in a preset phoneme set, obtaining the maximum likelihood of each phoneme in the boundary according to descending order of the likelihood, comparing the phoneme corresponding to the maximum likelihood with the corresponding reference phoneme in the history text, obtaining the number of other phonemes which are mistakenly divided by each phoneme in the training set, and converting the number of the phonemes which are confused into other phonemes. Assuming that the number of pronunciation basic units is N, an N × N phoneme confusion matrix PHCM (phonemeeconfusion matrix) is obtained according to statistics_ijDefined as the ith row and jth column of the matrix, indicates the probability of confusing the ith phoneme to the jth phoneme.

The following specifically exemplifies the above statistical-based construction process by taking an example of constructing a phoneme confusion matrix by english evaluation, where there are 48 phonemes in english, and a 48 × 48 matrix COUNT is defined according to the 48 phonemes.

The first step is as follows: a batch of manually labeled reading speech training sets of specific words are obtained according to different regions and crowds, obtaining a segmentation boundary corresponding to each read-aloud phoneme in the read-aloud speech based on the forced segmentation of the acoustic model on the training set, calculating the likelihoods of the current speakable phoneme and all the alternative list phonemes in the alternative list within the segmentation boundary, arranging the likelihoods of the current speakable phoneme and the alternative list phonemes in a descending order, obtaining the alternative list phoneme with the maximum likelihood (TOP1) in the calculation result, namely considering that the alternative list phoneme with the maximum likelihood is most similar to the speakable phoneme, comparing the alternative list phoneme with the maximum likelihood corresponding to the current speakable phoneme and the text phoneme corresponding to the current speakable phoneme, if the comparison result is different, counting and adding 1 at the position corresponding to the phoneme of the replacement list with the maximum likelihood in the matrix COUNT;

the second step is that: statistics of all phonemes in the training set is completed, and a matrix of 48 × 48, i.e., COUNT, is obtained. Wherein, COUNT_ijRepresenting the number of times the ith phoneme in the matrix is misinterpreted as the jth phoneme. Counting the COUNT according to lines to obtain the total number of occurrences of each phoneme in the training set, for example, counting the total number of occurrences of the ith phoneme as COUNT_iThen, then

The probability of confusing the ith phoneme to other phonemes is: PHCM_ij＝COUNT_i/_jCount_iTo obtain a 48 × 48 phoneme confusion matrix PHCM in english.

And secondly, constructing a voice unit confusion matrix based on the output of the acoustic feature hidden layer.

The construction of the speech unit confusion matrix based on the acoustic feature hidden layer output comprises the following steps: acquiring a second preset number of segments of historical voices obtained by reading historical texts of different historical users, and acquiring reference voices of the historical texts; performing voice unit boundary segmentation on the second preset number of segments of historical voice to obtain a segmentation boundary of each fourth voice unit in the second preset number of segments of historical voice; performing forward calculation on each fourth voice unit according to a pre-established acoustic model, and outputting acoustic hidden layer characteristics of each fourth voice unit in a segmentation boundary of the fourth voice unit; and constructing the voice unit confusion matrix according to the acoustical hidden layer characteristics of each fourth voice unit in the segmentation boundary thereof.

Wherein the constructing the phonetic unit confusion matrix according to the acoustical hidden layer characteristics of each fourth phonetic unit in the segmentation boundary thereof comprises: performing weighted average or Attention mechanism operation on all the acoustic hidden layer characteristics of each fourth voice unit to obtain a hidden layer output vector corresponding to each fourth voice unit; constructing a hidden layer output matrix according to each fourth voice unit and the corresponding hidden layer output vector thereof; calculating the Euclidean distance between any two fourth voice units in the hidden layer output matrix to obtain a voice unit distance matrix; and carrying out normalization operation on each row in the voice unit distance matrix to obtain the voice unit confusion matrix.

Wherein the normalization operation may comprise a SoftMax operation.

The following also takes english evaluation to construct a phoneme confusion matrix as an example, and specifically exemplifies the construction process of the acoustic feature hidden layer output, where there are 48 phonemes in english.

The first step is as follows: acquiring a speakable speech training set of a batch of manually labeled specific words according to different regions and crowds, obtaining a segmentation boundary corresponding to each speakable phoneme in the speakable speech on the basis of forced segmentation of an acoustic model on the training set, and outputting acoustic hidden layer characteristics of all frames of each speakable phoneme in the corresponding boundary through forward calculation of the acoustic model;

the second step is that: for all the hidden layer output results output by each reading phoneme, assuming that the number of hidden layer output nodes is M, and outputting all the hidden layer output of each reading phoneme through a weighted average or an Attention mechanism, so that each reading phoneme correspondingly obtains a one-dimensional hidden layer output vector (namely a vector with the size of 1 multiplied by M), 48 phonemes obtain 48 one-dimensional hidden layer output vectors, and a 48 multiplied by M hidden layer output matrix is constructed according to the 48 one-dimensional hidden layer output vectors;

the third step: by one-dimensional concealment of each speakable phonemeCalculating the Euclidean distance between any two reading phonemes by using the layer output vector, preliminarily obtaining a phoneme distance matrix Dist of 48 multiplied by 48, and converting the Euclidean distance between the phonemes to [0,1 ] after performing SoftMax operation on each row in the phoneme distance matrix Dist]In the spatial range of (2), a phoneme confusion matrix PHCM (phone fused matrix) is obtained_ijRepresenting the probability of the ith phoneme being confused as the jth phoneme in the matrix.

It can be seen that, in this example, the electronic device obtains the reading voices of the specific words from different regions and people as a training set to construct the voice unit confusion matrix, so that free configuration and free customization of the evaluation standards according to different regions and people can be supported, and the flexibility and compatibility of voice evaluation performed by the electronic device can be improved.

In a possible example of the present application, Pronunciation accuracy of each phonetic unit in the speakable text may be evaluated by a GOP measurement method (Goodness of pronunciations, a method for measuring accuracy of speakable phonetic units), and a GOP of each speakable phonetic unit is calculated by using a preset phonetic unit set within a segmentation boundary of the speakable phonetic units, so as to obtain an evaluation feature of how well each phonetic unit speaks in the speakable text. Among them, the conventional GOP calculation method is shown in formulas (2) and (3):

where o denotes the acoustic MFCC feature or FTBK feature of the speakable speech unit, ph_iRepresenting a voice unit, and obtaining a formula (3) if the voice unit prior is ignored;

wherein p (o | ph)_i) Indicating the current reading speech unit ph_iLikelihood of a textual phonetic Unit, p (o | ph)_k) Indicating the current reading speech unit ph_iLikelihood of each phonetic unit in the preset phonetic unit set.

As can be seen from the above equations (2) and (3), the GOP metric method is a conditional probability that measures the voice unit ph corresponding to the user voice in the case that the user voice is detected_iThe probability of (c). Wherein, the higher the probability, the more accurate the pronunciation is; the lower this probability, the worse the pronunciation.

The method improves the traditional GOP measurement method, and carries out GOP score calculation of the reading voice unit based on the weight coefficient matrix.

Wherein the determining pronunciation accuracy of each first speech unit according to the each first speech unit, the reference speech and the weight coefficient matrix comprises: and calculating the GOP score of each first voice unit within the segmentation boundary thereof according to the each first voice unit, the reference voice and the weight coefficient matrix.

Wherein the determining an evaluation score of the target speech according to the pronunciation accuracy of each first speech unit comprises: and inputting the calculated GOP score of each first phoneme in the segmentation boundary into a pre-established score mapping model to obtain the evaluation score of the target voice.

Wherein said calculating a GOP score of each first speech unit within its slicing boundary according to said each first speech unit, said reference speech and said weight coefficient matrix comprises: calculating a first likelihood of each first voice unit in a segmentation boundary thereof and a corresponding reference voice unit in the target text, and calculating a second likelihood of each first voice unit in the segmentation boundary thereof and each second voice unit to obtain a plurality of second likelihoods corresponding to each first voice unit; and calculating the GOP score of each first voice unit in the segmentation boundary thereof according to the first likelihood, the plurality of second likelihoods and the weight coefficient matrix.

In a specific implementation, the calculating the GOP score of each first speech unit within the segmentation boundary thereof according to the first likelihood, the plurality of second likelihoods, and the weight coefficient matrix may include the following two cases:

first, GOP score calculation based on denominator adjustment.

Compared with the traditional GOP score calculation, the GOP score calculation based on the denominator adjustment method increases the proportion of the target reading voice unit in the denominator calculation in the GOP calculation formula, and reduces the weight of other voice units in the preset voice unit, so that the GOP score calculation formula based on the denominator adjustment method is as follows:

wherein p (o | ph)_i) A first likelihood, p (o | ph), representing a first phonetic unit and a corresponding reference phonetic unit in the target text within the segmentation boundary_k) Representing a second likelihood for a certain said first phonetic unit and each said second phonetic unit,

and the weighting coefficients represent that a second voice unit corresponding to a certain first voice unit is mixed up to other second voice units, and i and k are positive integers.

Second, GOP score calculation based on molecular tuning.

Compared with the traditional GOP score calculation, the GOP score calculation based on the molecular adjustment method blurs the similarity degree of the target reading speech unit and the confusable speech unit according to the score tolerance, in the molecules of the GOP calculation, the likelihood of the target reading speech unit and the confusable speech unit is multiplied by corresponding weight coefficients, and each weight coefficient is related to the magnitude ordering of the likelihood, namely, the target reading speech unit is considered to be read into the confusable speech unit correctly on the basis of a given certain penalty, and the GOP score calculation formula based on the molecular adjustment method is as follows:

wherein p (o | ph)_i) Representing a certain said first languageFirst likelihood, p (o | ph), of a phonetic unit within its segmentation boundary and a corresponding reference phonetic unit in the target text_k) A second likelihood, p (o | ph), representing a certain one of said first phonetic units and each of said second phonetic units_Top1) A plurality of second likelihoods representing a correspondence of a certain first speech unit are arranged in a first second likelihood, p (o | ph), from large to small according to likelihood values_topN) A second likelihood representing that a plurality of second likelihoods corresponding to a certain first speech unit are arranged in the Nth order from the largest to the smallest in the likelihood values,

a weighting factor indicating that a second speech unit corresponding to a certain first speech unit is confused with a second speech unit corresponding to the maximum second likelihood corresponding to the first speech unit,

and the weighting coefficient represents that a second speech unit corresponding to a certain first speech unit is mixed into a second speech unit corresponding to the Nth maximum second likelihood corresponding to the first speech unit, and i, k and N are positive integers.

It can be seen that, in this example, when the electronic device calculates the pronunciation accuracy of the voice unit by using GOP scoring, the scoring tolerance of mutual confusion between the voice units is manually set, and the similarity between the target reading voice unit and the confusable voice unit is blurred according to the scoring tolerance, so that the problem of loose scoring can be solved, and the flexibility and compatibility of the electronic device for voice evaluation can be improved.

Referring to fig. 3 in accordance with the embodiment shown in fig. 2, fig. 3 is a schematic structural diagram of an electronic device 300 according to an embodiment of the present application, and as shown in fig. 3, the electronic device 300 includes an application processor 310, a memory 320, a communication interface 330, and one or more programs 321, where the one or more programs 321 are stored in the memory 320 and configured to be executed by the application processor 310, and the one or more programs 321 include instructions for:

acquiring target voice obtained by reading a target text by a user, and acquiring reference voice of the target text; determining an evaluation score of the target voice according to the reference voice and a preset score tolerance, wherein the score tolerance of a pronunciation confusion event of at least one voice unit in the target voice is not zero, and the pronunciation confusion event refers to an event that one voice unit is confused into other voice units; and outputting the evaluation score.

In one possible example, in the aspect of determining the evaluation score of the target speech according to the reference speech and the preset score tolerance, the instructions in the program are specifically configured to perform the following operations: performing voice unit boundary segmentation on the target voice to obtain a segmentation boundary of each first voice unit in the target voice; constructing a weight coefficient matrix according to the voice unit confusion matrix and the grading tolerance, wherein the weight coefficient matrix comprises weight coefficients ij when the ith second voice unit in the preset voice unit set is confused into the jth second voice unit, and i and j are positive integers; determining pronunciation accuracy of each first voice unit according to the each first voice unit, the reference voice and the weight coefficient matrix; and determining the evaluation score of the target voice according to the pronunciation accuracy of each first voice unit.

In one possible example, the determining the pronunciation accuracy aspect of each first speech unit from the each first speech unit, the reference speech and the weight coefficient matrix, the instructions in the program are specifically configured to: and calculating the GOP score of each first voice unit within the segmentation boundary thereof according to the each first voice unit, the reference voice and the weight coefficient matrix.

In one possible example, the instructions in the program are specifically configured to perform the following operations in calculating a GOP score of each first speech unit within its segmentation boundary based on the each first speech unit, the reference speech and the weight coefficient matrix: calculating a first likelihood of each first voice unit in a segmentation boundary thereof and a corresponding reference voice unit in the target text, and calculating a second likelihood of each first voice unit in the segmentation boundary thereof and each second voice unit to obtain a plurality of second likelihoods corresponding to each first voice unit;

and calculating the GOP score of each first voice unit in the segmentation boundary thereof according to the first likelihood, the plurality of second likelihoods and the weight coefficient matrix.

In one possible example, the instructions in the program are specifically configured to perform the following operations in constructing a weight coefficient matrix from a phonetic unit confusion matrix and the score tolerance: acquiring a first preset number of segments of historical voices obtained by reading historical texts of different historical users, and acquiring reference voices of the historical texts; performing voice unit boundary segmentation on the first preset number of segments of historical voice to obtain a segmentation boundary of each third voice unit in the first preset number of segments of historical voice; calculating a third likelihood of each third voice unit in the segmentation boundary and each second voice unit to obtain a plurality of third likelihoods corresponding to each third voice unit; arranging a plurality of third likelihood degrees corresponding to each third voice unit according to a descending order, and selecting a maximum third likelihood degree corresponding to each third voice unit; comparing the second voice unit corresponding to the maximum third likelihood corresponding to each third voice unit with the reference voice unit corresponding to each third voice unit in the reference voice of the historical text; and constructing the voice unit confusion matrix according to the comparison result.

In a possible example, in the aspect of constructing the phonetic unit confusion matrix according to the comparison result, the instructions in the program are specifically configured to perform the following operations: counting the result of the failed comparison of the second voice unit corresponding to each maximum third likelihood to obtain the times of wrongly dividing each second voice unit into other second voice units; constructing a first matrix according to the times that all the second voice units and each second voice unit are wrongly divided into other second voice units; and summing the elements of each row of the first matrix to obtain the frequency of the second voice unit corresponding to the row appearing in the statistics, and dividing each element of the first matrix by the frequency of the second voice unit corresponding to the row of the first matrix appearing in the statistics to obtain the voice unit confusion matrix.

In one possible example, the instructions in the program are specifically configured to perform the following operations in constructing a weight coefficient matrix from a phonetic unit confusion matrix and the score tolerance: acquiring a second preset number of segments of historical voices obtained by reading historical texts of different historical users, and acquiring reference voices of the historical texts; performing voice unit boundary segmentation on the second preset number of segments of historical voice to obtain a segmentation boundary of each fourth voice unit in the second preset number of segments of historical voice; performing forward calculation on each fourth voice unit according to a pre-established acoustic model, and outputting acoustic hidden layer characteristics of each fourth voice unit in a segmentation boundary of the fourth voice unit; and constructing the voice unit confusion matrix according to the acoustical hidden layer characteristics of each fourth voice unit in the segmentation boundary thereof.

In one possible example, each element in the phonetic unit confusion matrix corresponds to one of the score tolerances, and in constructing the weight coefficient matrix from the phonetic unit confusion matrix and the score tolerances, the instructions in the program are specifically configured to: for each element in the phonetic unit confusion matrix, performing the following processing operations to obtain the weight coefficient matrix: obtaining the score tolerance corresponding to the currently processed element; and determining the weight coefficient of the currently processed element according to the currently processed element and the score tolerance corresponding to the currently processed element.

It should be noted that the electronic device described in this embodiment may perform all the steps of the method described in the method embodiment.

The above description has introduced the solution of the embodiment of the present application mainly from the perspective of the method-side implementation process. It is understood that the electronic device comprises corresponding hardware structures and/or software modules for performing the respective functions in order to realize the above-mentioned functions. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative elements and algorithm steps described in connection with the embodiments provided herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Referring to fig. 4, fig. 4 is a block diagram of functional units of a speech evaluation device 400 according to an embodiment of the present application. The voice evaluating apparatus 400 is applied to an electronic device, and the voice evaluating apparatus 400 includes:

an obtaining unit 401, configured to obtain a target voice obtained by reading a target text aloud by a user, and obtain a reference voice of the target text;

a determining unit 402, configured to determine an evaluation score of the target speech according to the reference speech and a preset score tolerance, where the score tolerance of a pronunciation confusion event of at least one speech unit in the target speech is not zero, and the pronunciation confusion event refers to an event that a speech unit is confused into other speech units;

an output unit 403, configured to output the evaluation score.

The voice evaluating apparatus 400 may further include a storage unit and a communication unit; the storage unit is used for storing program codes and data of the electronic equipment; the storage unit may be a memory and the communication unit may be an internal communication interface.

In one possible example, the determining unit 402 determines an evaluation score aspect of the target speech according to the reference speech and a preset score tolerance, and includes:

the segmentation module is used for carrying out voice unit boundary segmentation on the target voice to obtain a segmentation boundary of each first voice unit in the target voice;

a building module, configured to build a weight coefficient matrix according to the voice unit confusion matrix and the score tolerance, where the weight coefficient matrix includes a weight coefficient ij when an ith second voice unit in a preset voice unit set is confused as a jth second voice unit, and i and j are positive integers;

a first determining module, configured to determine pronunciation accuracy of each first speech unit according to the each first speech unit, the reference speech and the weight coefficient matrix;

and the second determining module is used for determining the evaluation score of the target voice according to the pronunciation accuracy of each first voice unit.

In one possible example, the determining the pronunciation accuracy aspect of each first speech unit according to the each first speech unit, the reference speech and the weight coefficient matrix, the first determining module is specifically configured to: and calculating the GOP score of each first voice unit within the segmentation boundary thereof according to the each first voice unit, the reference voice and the weight coefficient matrix.

In one possible example, said calculating a GOP score of said each first speech unit within its slicing boundary based on said each first speech unit, said reference speech and said weight coefficient matrix, said first determining module comprises:

the first calculation submodule is used for calculating a first likelihood of each first voice unit in the segmentation boundary of the first voice unit and a corresponding reference voice unit in the target text, and calculating a second likelihood of each first voice unit in the segmentation boundary of the first voice unit and each second voice unit to obtain a plurality of second likelihoods corresponding to each first voice unit;

and the second calculating submodule is used for calculating the GOP score of each first voice unit in the segmentation boundary of the first voice unit according to the first likelihood, the plurality of second likelihoods and the weight coefficient matrix.

In one possible example, in constructing the weight coefficient matrix according to the phonetic unit confusion matrix and the score tolerance, the determining unit 402 further comprises:

the acquisition module is used for acquiring a first preset number of segments of historical voices obtained by reading historical texts by different historical users and acquiring reference voices of the historical texts;

the segmentation module is further configured to perform speech unit boundary segmentation on the first preset number of segments of historical speech to obtain a segmentation boundary of each third speech unit in the first preset number of segments of historical speech;

the calculation module is used for calculating a third likelihood of each third voice unit in the segmentation boundary of the third voice unit and each second voice unit to obtain a plurality of third likelihoods corresponding to each third voice unit;

the selecting module is used for arranging the plurality of third likelihood degrees corresponding to each third voice unit according to a descending order and selecting the maximum third likelihood degree corresponding to each third voice unit;

a comparison module, configured to compare the second speech unit corresponding to the maximum third likelihood corresponding to each third speech unit with the reference speech unit corresponding to each third speech unit in the reference speech of the historical text;

the construction module is further configured to construct the phonetic unit confusion matrix according to the comparison result.

In one possible example, in the aspect of constructing the phonetic unit confusion matrix according to the comparison result, the construction module includes:

the statistic submodule is used for counting the result of the failed comparison of the second voice unit corresponding to each maximum third likelihood to obtain the times of wrongly dividing each second voice unit into other second voice units;

the construction submodule is used for constructing a first matrix according to the times that all the second voice units and each second voice unit are wrongly divided into other second voice units;

the construction submodule is further configured to sum up elements of each row of the first matrix to obtain the number of times that the second speech unit corresponding to the row appears in the statistics, and divide each element of the first matrix by the number of times that the second speech unit corresponding to the row of the first matrix appears in the statistics to obtain the speech unit confusion matrix.

In one possible example, in constructing the weight coefficient matrix according to the phonetic unit confusion matrix and the score tolerance, the determining unit 402 further comprises: :

the acquisition module is further used for acquiring a second preset number of segments of historical voices obtained by reading historical texts by different historical users and acquiring reference voices of the historical texts;

the segmentation module is further configured to perform speech unit boundary segmentation on the second preset number of segments of historical speech to obtain a segmentation boundary of each fourth speech unit in the second preset number of segments of historical speech;

the output module is used for carrying out forward calculation on each fourth voice unit according to a pre-established acoustic model and outputting the acoustic hidden layer characteristics of each fourth voice unit in the segmentation boundary;

the building module is further configured to build the phonetic unit confusion matrix according to the acoustical hidden layer characteristics of each fourth phonetic unit within the segmentation boundary thereof.

In one possible example, each element in the phonetic unit confusion matrix corresponds to one score tolerance, and the constructing module is specifically configured to, in accordance with the phonetic unit confusion matrix and the score tolerance, construct a weight coefficient matrix:

for each element in the phonetic unit confusion matrix, performing the following processing operations to obtain the weight coefficient matrix:

obtaining the score tolerance corresponding to the currently processed element;

and determining the weight coefficient of the currently processed element according to the currently processed element and the score tolerance corresponding to the currently processed element.

It should be noted that the logic unit described in this embodiment may execute the method described in the method embodiment. In addition, it can be understood that, since the method embodiment and the apparatus embodiment are different presentation forms of the same technical concept, the content of the method embodiment portion in the present application should be synchronously adapted to the apparatus embodiment portion, and is not described herein again.

Embodiments of the present application also provide a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, the computer program enabling a computer to execute part or all of the steps of any one of the methods described in the above method embodiments, and the computer includes an electronic device.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods as described in the above method embodiments. The computer program product may be a software installation package, the computer comprising an electronic device.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for speech assessment, the method comprising:

and outputting the evaluation score.

2. The method according to claim 1, wherein the determining an evaluation score of the target speech based on the reference speech and a preset score tolerance comprises:

performing voice unit boundary segmentation on the target voice to obtain a segmentation boundary of each first voice unit in the target voice;

constructing a weight coefficient matrix according to the voice unit confusion matrix and the grading tolerance, wherein the weight coefficient matrix comprises weight coefficients ij when the ith second voice unit in the preset voice unit set is confused into the jth second voice unit, and i and j are positive integers;

determining pronunciation accuracy of each first voice unit according to the each first voice unit, the reference voice and the weight coefficient matrix;

and determining the evaluation score of the target voice according to the pronunciation accuracy of each first voice unit.

3. The method according to claim 2, wherein said determining pronunciation accuracy of said each first speech unit from said each first speech unit, said reference speech and said weight coefficient matrix comprises:

and calculating the GOP score of each first voice unit within the segmentation boundary thereof according to the each first voice unit, the reference voice and the weight coefficient matrix.

4. The method of claim 3, wherein said calculating a GOP score for said each first speech unit within its slicing boundaries based on said each first speech unit, said reference speech, and said weighting coefficient matrix comprises:

calculating a first likelihood of each first voice unit in a segmentation boundary thereof and a corresponding reference voice unit in the target text, and calculating a second likelihood of each first voice unit in the segmentation boundary thereof and each second voice unit to obtain a plurality of second likelihoods corresponding to each first voice unit;

5. The method of claim 2, wherein prior to constructing a matrix of weight coefficients from a phonetic unit confusion matrix and the score tolerance, the method further comprises:

acquiring a first preset number of segments of historical voices obtained by reading historical texts of different historical users, and acquiring reference voices of the historical texts;

performing voice unit boundary segmentation on the first preset number of segments of historical voice to obtain a segmentation boundary of each third voice unit in the first preset number of segments of historical voice;

calculating a third likelihood of each third voice unit in the segmentation boundary and each second voice unit to obtain a plurality of third likelihoods corresponding to each third voice unit;

arranging a plurality of third likelihood degrees corresponding to each third voice unit according to a descending order, and selecting a maximum third likelihood degree corresponding to each third voice unit;

comparing the second voice unit corresponding to the maximum third likelihood corresponding to each third voice unit with the reference voice unit corresponding to each third voice unit in the reference voice of the historical text;

and constructing the voice unit confusion matrix according to the comparison result.

6. The method of claim 5, wherein the constructing the phonetic unit confusion matrix according to the comparison result comprises:

counting the result of the failed comparison of the second voice unit corresponding to each maximum third likelihood to obtain the times of wrongly dividing each second voice unit into other second voice units;

constructing a first matrix according to the times that all the second voice units and each second voice unit are wrongly divided into other second voice units;

and summing the elements of each row of the first matrix to obtain the frequency of the second voice unit corresponding to the row appearing in the statistics, and dividing each element of the first matrix by the frequency of the second voice unit corresponding to the row of the first matrix appearing in the statistics to obtain the voice unit confusion matrix.

7. The method of claim 2, wherein prior to constructing a matrix of weight coefficients from a phonetic unit confusion matrix and the score tolerance, the method further comprises:

acquiring a second preset number of segments of historical voices obtained by reading historical texts of different historical users, and acquiring reference voices of the historical texts;

performing voice unit boundary segmentation on the second preset number of segments of historical voice to obtain a segmentation boundary of each fourth voice unit in the second preset number of segments of historical voice;

performing forward calculation on each fourth voice unit according to a pre-established acoustic model, and outputting acoustic hidden layer characteristics of each fourth voice unit in a segmentation boundary of the fourth voice unit;

and constructing the voice unit confusion matrix according to the acoustical hidden layer characteristics of each fourth voice unit in the segmentation boundary thereof.

8. The method of claim 2, wherein each element in the phonetic unit confusion matrix corresponds to one of the score tolerances, and wherein constructing a weight coefficient matrix according to the phonetic unit confusion matrix and the score tolerances comprises:

obtaining the score tolerance corresponding to the currently processed element;

9. A speech evaluation apparatus, the apparatus comprising:

and the output unit is used for outputting the evaluation score.

10. An electronic device comprising a processor, a memory, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-8.

11. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-8.