CN115148225A - Intonation scoring method, intonation scoring system, computing device and storage medium - Google Patents

Intonation scoring method, intonation scoring system, computing device and storage medium Download PDF

Info

Publication number
CN115148225A
CN115148225A CN202110338134.4A CN202110338134A CN115148225A CN 115148225 A CN115148225 A CN 115148225A CN 202110338134 A CN202110338134 A CN 202110338134A CN 115148225 A CN115148225 A CN 115148225A
Authority
CN
China
Prior art keywords
fundamental frequency
text
audio
target audio
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110338134.4A
Other languages
Chinese (zh)
Inventor
马楠
夏龙
高强
吴凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ape Power Future Technology Co Ltd
Original Assignee
Beijing Ape Power Future Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ape Power Future Technology Co Ltd filed Critical Beijing Ape Power Future Technology Co Ltd
Priority to CN202110338134.4A priority Critical patent/CN115148225A/en
Publication of CN115148225A publication Critical patent/CN115148225A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application provides a tone scoring method, a tone scoring system, a computing device and a storage medium. The method comprises the following steps: acquiring target audio corresponding to text content input by a user; obtaining example audio corresponding to the text content; respectively calculating fundamental frequency envelopes of the target audio and the example audio; respectively obtaining time alignment information of the same text segment in the text content corresponding to the target audio and the example audio; calculating at least two types of fundamental frequency similarity characteristic values between the target audio and the example audio according to the fundamental frequency envelopes and the time alignment information; and inputting the at least two types of fundamental frequency similarity characteristic values into a preset tone scoring model to obtain tone scoring of the target audio. The method and the device improve the adaptability and the scoring effect of intonation scoring, and improve the reasonability and the accuracy of scoring the target audio.

Description

Intonation scoring method, intonation scoring system, computing device and storage medium
Technical Field
The present application relates to the field of intonation scoring technologies, and in particular, to an intonation scoring method, an intonation scoring system, a computing device, and a storage medium.
Background
In the related art, when the intonation score is specifically evaluated, the fundamental frequency similarity of the target voice and the sample voice is usually measured by using a relatively single measurement method, and the method is too simple, has insufficient adaptability and has limited effect. Particularly in the teaching scene of children English, the spoken pronunciations of students are more random and changeable than those of adult students, and a stronger intonation evaluation method needs to be provided.
Disclosure of Invention
In order to overcome the problems in the related art, the method for scoring the intonation can achieve more accurate, stable and reasonable scoring of the spoken intonation.
A intonation scoring method comprises the following steps: acquiring target audio corresponding to text content input by a user; obtaining example audio corresponding to the text content; respectively calculating fundamental frequency envelopes of the target audio and the example audio; respectively obtaining time alignment information of the same text segments in the text contents corresponding to the target audio and the example audio; calculating at least two types of fundamental frequency similarity characteristic values between the target audio and the example audio according to the fundamental frequency envelopes and the time alignment information; and inputting the at least two types of fundamental frequency similarity characteristic values into a preset tone scoring model to obtain the tone scoring of the target audio.
In the above method, the text segments include a first type of text segment, and the first type of text segment includes at least one word; the obtaining time alignment information of the same text segment in the text content corresponding to the target audio and the example audio respectively, and calculating at least two types of fundamental frequency similarity characteristic values between the target audio and the example audio according to the fundamental frequency envelope and the time alignment information includes: respectively obtaining first time alignment information of the same first type of text segments in the text content corresponding to the target audio and the example audio; and calculating at least two types of fundamental frequency similarity characteristic values based on the first type of text segments between the target audio and the example audio according to the fundamental frequency envelope and the first time alignment information.
The method further comprises the following steps: dividing the text content to obtain at least two first-class text segments, wherein the number of words included in each first-class text segment is the same; or, the text content is divided to obtain at least two first-class text segments, and the number of words contained in at least two first-class text segments is different.
In the above method, the fundamental frequency similarity feature value obtained based on the first type text segment includes a basic statistic feature value, a distance feature value, a fundamental frequency range feature value, and a polynomial fitting coefficient feature value.
In the method, the basic statistic feature values obtained based on the first type of text segment include a minimum fundamental frequency, a maximum fundamental frequency, a median value of the fundamental frequency, and a value range of the fundamental frequency.
The distance characteristic values obtained based on the first type of text segments comprise Euclidean distance, manhattan distance, DTW distance, longest common subsequence score and BLEU score.
Wherein, the fundamental frequency range class characteristic value comprises: respectively calculating the fundamental frequency value range of the target audio and the sample audio of each text segment; calculating the ratio of the fundamental frequency value range of each text segment of the target audio to the fundamental frequency value range of each text segment of the example audio; the maximum value, or the minimum value, or the average value of all the ratios is obtained.
Wherein the characteristic values of the plurality of fitting coefficients include: fitting the fundamental frequency sequence of the target audio frequency by adopting a polynomial to obtain a fitting coefficient sequence as a characteristic value.
The method further comprises the following steps: dividing the text content into a plurality of second type text segments, wherein the second type text segments comprise at least one syllable; obtaining second time alignment information of a same second type of text segment in the text content corresponding to the target audio and the example audio; calculating at least one class of syllable-level fundamental frequency similarity characteristic value based on a second class of text pieces between the target audio and the example audio according to the fundamental frequency envelope and the second time alignment information; inputting the at least two types of fundamental frequency similarity characteristic values based on the first type of text segments and the at least one type of fundamental frequency similarity characteristic values based on the second type of text segments into a preset intonation scoring model to obtain a result of the intonation scoring of the target audio.
Wherein the dividing the text content into a plurality of second-type text segments comprises: dividing the text content to obtain at least two second-class text segments, wherein the number of syllables of each second-class text segment is the same; or, the text content is divided to obtain at least two second-type text segments, and the number of syllables of the at least two second-type text segments is different.
The fundamental frequency similarity characteristic value based on the second type text segment comprises a basic statistic characteristic value, a fundamental frequency change trend characteristic value and a distance correlation characteristic value.
In the above method, the obtaining of the example audio corresponding to the text content includes: and obtaining the language features of the words in the text content, inputting the language features of the words into a preset acoustic feature prediction model, and obtaining the fundamental frequency envelope of the example audio corresponding to the text content.
Wherein, the inputting the language feature of the word into a preset acoustic feature prediction model to obtain the fundamental frequency envelope of the example audio corresponding to the text content includes: inputting the language characteristics of the words into a voice duration prediction module to obtain the pronunciation duration of the words; obtaining a language feature sequence which expands the language features of the word to the multiple based on the pronunciation duration of the word and the multiple of the fundamental frequency envelope sampling period; and inputting the expanded language feature sequence into a fundamental frequency prediction module to obtain the fundamental frequency envelope of the audio frequency of the word.
Wherein the language features are vectors comprising one or more of the following information: the current word content, the previous word content and the next word content; and the current word syllable count, the previous word syllable count, and the next word syllable count.
A intonation scoring system comprising:
an audio input module for inputting a target audio corresponding to the text content and an example audio corresponding to the text content;
the audio preprocessing module is electrically connected with the audio input module and used for calculating fundamental frequency envelopes of the target audio and the example audio and calculating time alignment information of the same text segment in text content corresponding to the target audio and the example audio;
a fundamental frequency similarity characteristic value calculating module, electrically connected to the audio preprocessing module, for calculating at least two classes of fundamental frequency similarity characteristic values between the target audio and the example audio according to the respective fundamental frequency envelopes of the target audio and the example audio and the time alignment information;
and the prediction score module is electrically connected with the fundamental frequency similarity characteristic value calculation module and is used for scoring the intonation of the target audio according to the at least two classes of fundamental frequency similarity characteristic values.
In the above system, the text segments include a first type of text segment, and the first type of text segment includes at least one word;
and an audio processing module, wherein the calculating of the time alignment information of the same text segment in the text content corresponding to the target audio and the example audio comprises: calculating first time alignment information of the target audio and the example audio corresponding to the same first type of text segments in the text content;
and a fundamental frequency similarity feature value calculation module, configured to calculate at least two types of fundamental frequency similarity feature values between the target audio and the example audio according to the fundamental frequency envelope and the time alignment information, including: and calculating at least two types of fundamental frequency similarity characteristic values between the target audio and the example audio based on the first type of text segments according to the fundamental frequency envelopes and the first time alignment information.
In the above system, the audio preprocessing module further includes: dividing the text content to obtain at least two first-class text segments, wherein the number of words included in each first-class text segment is the same; or dividing the text content to obtain at least two first-class text segments, wherein the at least two first-class text segments have different word numbers.
Wherein, the fundamental frequency similarity characteristic value obtained based on the first type text segment comprises: basic statistic eigenvalue, distance eigenvalue, fundamental frequency range class eigenvalue and polynomial fitting coefficient eigenvalue.
In the above system, the audio preprocessing module further includes: dividing the text content into a plurality of second type text segments, wherein the second type text segments comprise at least one syllable;
obtaining second time alignment information of a same second type of text segment in the text content corresponding to the target audio and the example audio;
the fundamental frequency similarity characteristic value calculation module is used for calculating at least one class of syllable-level fundamental frequency similarity characteristic values between the target audio and the example audio based on a second class of text patches according to the fundamental frequency envelope and the second time alignment information;
and the prediction score module inputs the at least two types of fundamental frequency similarity characteristic values based on the first type of text segments and the at least one type of fundamental frequency similarity characteristic values based on the second type of text segments into a preset intonation scoring model to obtain an intonation scoring result of the target audio.
A computing device, comprising:
a processor; and
a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.
A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of a computing device, causes the processor to perform the method as described above.
According to the tone scoring method, the target audio is scored by taking the at least two types of fundamental frequency similarity characteristic values based on word levels as calculation parameters, the spoken utterances in different stages can be reasonably scored, the condition that the scoring effect is limited when a single variable is used for scoring the spoken utterances is avoided, the tone scoring adaptability and the scoring effect are improved, and the target audio scoring reasonability and accuracy are improved.
Furthermore, the embodiment of the invention also divides the example audio and the target audio in a syllable level, and then takes at least two types of fundamental frequency similarity characteristic values of the single word level and at least one type of fundamental frequency similarity characteristic value of the syllable level as model input to obtain a intonation scoring result. The accuracy of scoring the target audio is further improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the application.
Fig. 1 is a schematic flow chart illustrating a intonation scoring method according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating a intonation scoring method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating a intonation scoring method according to an embodiment of the present application;
FIG. 4 is a block diagram illustrating a schematic structure of a intonation scoring system according to an embodiment of the present disclosure;
fig. 5 is a block diagram schematically illustrating a structure of an electronic device according to an embodiment of the present application.
Description of reference numerals:
100. a tone scoring system; 101. an audio input module; 102. an audio preprocessing module; 103. a fundamental frequency similarity characteristic value calculating module; 104. a prediction score module; 200. an electronic device; 201. a memory; 202. a processor.
Detailed Description
Preferred embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present application have been illustrated in the accompanying drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
In the related art, when the intonation score is specifically evaluated, a relatively single measurement method is usually used for measuring the fundamental frequency similarity of the target voice and the sample voice, and the method is too simple, insufficient in adaptability and limited in effect. Particularly in the teaching scene of children English, the spoken pronunciations of students are more random and changeable than those of adult students, and a stronger intonation evaluation method needs to be provided.
In view of the above problems, embodiments of the present application provide a intonation scoring method, which can score a spoken utterance more accurately, more stably, and more reasonably. The embodiment of the invention takes the pronunciation score of the english sentence as an example, but the invention is not limited to be applied to other scenes of pronunciation score of languages, such as chinese, german, japanese, etc.
The method for tone scoring provided by the embodiment of the invention comprises the following steps:
acquiring target audio corresponding to text content input by a user; and obtaining example audio corresponding to the textual content. The example audio is a standard pronunciation of the text content and corresponds to the text content, and the system stores the example audio corresponding to the text content in advance. For one text content, the complete example audio corresponding to the text content can be stored; standard pronunciations for words making up the textual content may also be stored, from which they are combined into example audio corresponding to the textual content.
The target audio is an audio recording of text content shown by a user reading system, and the embodiment of the invention is used for scoring whether the pronunciation of the target audio is standard or not. The form of the scoring is not limited by the present invention, and for example, the scoring is performed in a percentage system, or is set to a plurality of levels, for example, three levels of a, B, and C.
Respectively calculating fundamental frequency envelopes of the target audio and the example audio; dividing the text content into a plurality of text segments, and acquiring time information of the target audio corresponding to the text segments; and time alignment information that the example audio corresponds to the same text segments. Thus, the example audio and the target audio are aligned in time for the same text segment. According to the time alignment information, for a certain text segment, example audio and target audio corresponding to the text segment can be extracted.
And calculating at least two types of fundamental frequency similarity characteristic values between the target audio and the example audio according to the fundamental frequency envelopes and the time alignment information.
Since for any text segment in the text content, the example audio and the target audio corresponding to the text segment can be extracted, the fundamental frequency similarity characteristic value between the target audio and the example audio can be calculated. The method needs to calculate at least two types of fundamental frequency similarity characteristic values.
And inputting the at least two types of fundamental frequency similarity characteristic values into a preset intonation scoring model to obtain an intonation scoring result of the target audio.
In another embodiment of the present invention, the text segment contains at least one word. Namely, the text content is divided by taking the word as the minimum unit to obtain a plurality of text segments.
With the preferred approach of dividing each word in the text content into separate segments. For example, the text content is: "What is you name? ", which is divided into four text segments, respectively," What "," is "," your "," name ", thereby obtaining audio segments corresponding to the four text segments, respectively, for the example audio and the target audio.
In addition to the above, the text segment obtained in units of words in the present invention may contain two or more words, which sometimes constitute a phrase. In "What is you name? For example, the text segment is divided into two text segments, namely "What is" and "you name", and the audio segments corresponding to the example audio and the target audio are obtained.
As another implementation, to obtain the text segments, the number of words contained in each text segment is not the same. Certain rules or algorithms may be used to obtain text segments with different numbers of words or partially the same number of words. In "What is you name? "for example, three text segments in its scope can be respectively" What is "," you ", and" name ". It can be seen that the first text segment contains two words, and the second and third text segments each contain one word. Further, audio segments corresponding to the three text segments of the example audio and the target audio are obtained.
Obviously, for complex sentences, the text content can be divided in a more complex manner. This is not described in detail herein.
The following describes in detail the technical solution of the embodiment of the present application, with reference to the drawings, assuming a scenario in which text segments each include one word.
Fig. 1 is a schematic flowchart illustrating a intonation scoring method according to an embodiment of the present application.
Referring to fig. 1, a first aspect of the present application provides a intonation scoring method, including:
step S100: acquiring target audio corresponding to text content input by a user; and obtaining example audio corresponding to the textual content.
Step S102: respectively calculating fundamental frequency envelopes of the target audio and the example audio; and respectively obtaining first time alignment information of the same word in the text content of the target audio and the example audio.
Step S104: and calculating at least two types of word-level fundamental frequency similarity characteristic values between the target audio and the example audio in units of words according to the fundamental frequency envelope and the first time alignment information.
The two types of word-level fundamental frequency similarity characteristic values are word-level fundamental frequency similarity characteristic values of two dimensions. The method comprises word-level basic statistic characteristic values, word-level distance characteristic values, word-level fundamental frequency range characteristic values and word-level polynomial fitting coefficient characteristic values. In the embodiment, more accurate intonation scores are obtained by obtaining fundamental frequency similarity characteristic values of multiple dimensions.
Step S106: and inputting the similarity characteristic values of the fundamental frequencies of at least two word levels into a preset intonation scoring model to obtain an intonation scoring result of the target audio.
In the spoken language tone scoring method in the above embodiment, the target audio corresponding to the text content and input by the user is obtained as the audio to be scored, and the obtained example audio corresponding to the text content is the standard audio used for comparing with the target audio to score.
Firstly, calculating fundamental frequency envelopes of target audio and example audio, and respectively acquiring first time alignment information of the same word in the text content corresponding to the target audio and the example audio in terms of time, namely acquiring starting and ending times of pronunciation of the same word in the corresponding text of the target audio and the example audio, and calculating word-level fundamental frequency similarity characteristic values of multiple dimensions between the target audio and the example audio by taking the word as a unit.
For example, according to the starting and ending time of the same word of the target audio and the example audio, 25 fundamental frequency sequence points of the word are respectively sampled at equal intervals to obtain an average value of the 25 fundamental frequency sample point sequences, and fundamental frequency similarity characteristic values of at least two dimensions of the word level of the target audio and the example audio can be calculated through comparison and analysis of fundamental frequency characteristics of multiple dimensions.
And secondly, using the word-level fundamental frequency similarity characteristic values of at least two dimensions as the input of a preset intonation scoring model. The preset intonation scoring model is, for example, a machine learning model, and the machine learning model can be obtained by using, for example, logistic regression, a support vector machine, or deep neural network training of various structures. Thus, when the word-level fundamental frequency similarity characteristic values of at least two dimensions between the target audio and the example audio are input, the preset intonation scoring model can calculate the intonation score of the target audio through a machine learning model.
Based on the description, when the word-level fundamental frequency similarity characteristic values of at least two dimensions are used as calculation parameters to evaluate the spoken language intonation of the target audio, the spoken language intonations of different stages can be reasonably evaluated, meanwhile, the situation that the spoken language intonation is evaluated by using a single variable is avoided, the adaptability and the evaluation effect are better, the rationality of evaluating the target audio is improved, and the accuracy of evaluating the target audio is improved.
The calculation of the fundamental frequency envelope of the speech may use various algorithms, such as an autocorrelation method of the open source phonetic software Praat, a YIN algorithm, an improved algorithm provided by the open source speech recognition kit Kaldi, and so on. The speech and the text are forcibly aligned to obtain first time alignment information of the same word in the text content corresponding to the target audio and the example audio, which may be implemented by using a Viterbi (Viterbi) algorithm, for example, using open source speech tool software HTK or Kaldi.
In an embodiment of the application, the word-level fundamental frequency similarity feature values include a word-level basic statistic feature value, a word-level distance feature value, a word-level fundamental frequency range class feature value, and a word-level polynomial fitting coefficient feature value.
The calculated word-level basic statistic characteristic value, the word-level distance characteristic value and the word-level fundamental frequency range characteristic value are jointly input into a preset intonation scoring model, and a machine learning module in the intonation scoring model carries out weighting scoring according to the input word-level basic statistic characteristic value, the word-level distance characteristic value, the word-level fundamental frequency range characteristic value and various intonation scores corresponding to the word-level basic statistic characteristic value, the word-level distance characteristic value, the word-level fundamental frequency range characteristic value and the intonation scores, so that an intonation scoring result of the target audio is obtained.
In some embodiments, the word-level basic statistic feature values include minimum fundamental frequency, maximum fundamental frequency, median fundamental frequency, and range of fundamental frequency values of words in the sentence.
In some embodiments, the word-level distance features include a Euclidean distance, a Manhattan distance, a DTW distance, a longest common subsequence score, and a BLEU score between the same words in the textual content.
In some embodiments, the word-level fundamental frequency range class feature values include a maximum value of the target audio word fundamental frequency range in proportion to the example audio fundamental frequency range, a minimum value of the target audio word fundamental frequency range in proportion to the example audio fundamental frequency range, an average value of the target audio word fundamental frequency range in proportion to the example audio fundamental frequency range, and a proportion of the fundamental frequency range of the target audio whole sentence in proportion to the example audio whole sentence fundamental frequency range.
Specifically, the word-level basic statistic feature values include the minimum fundamental frequency, the maximum fundamental frequency, the median fundamental frequency and the fundamental frequency value range of the words (i.e. the maximum fundamental frequency minus the minimum fundamental frequency of the whole sentence).
The word-level distance feature values are mainly used for calculating various distances of fundamental frequency sequences between target audio and corresponding words of the example audio, wherein the distances include Euclidean distance, manhattan distance, DTW (dynamic time deviation) distance, longest common subsequence score, BLEU (bilingual evaluation and replacement) score and the like.
For example, assuming that the text content is composed of 3 words, when the euclidean distance between the target audio and the sample audio of the text content is calculated, the euclidean distance between every two target audio and sample audio corresponding to the same word is calculated, and then the computed euclidean distances between three pairs of words are averaged, so as to obtain the euclidean distance characteristics.
The Manhattan distance and the DTW distance are calculated, and the method for calculating the Euclidean distance and the Euclidean distance is similar to that for calculating the Euclidean distance, namely the Manhattan distance or the DTW distance between words corresponding to the audio with the scores and the sample audio is obtained, and then the Manhattan distance or the DTW distance between all three pairs of words is averaged.
The longest common subsequence score and the BLEU (bilingual estimation and replacement) score are obtained by converting the sequence of fundamental frequency sampling points in each word into a variation trend sequence and comparing the distances on the variation trend sequence. The trend sequence refers to the ratio of the length of the longest common subsequence of the calculated target audio and the example audio fundamental frequency trend sequence to the length of the whole trend sequence.
The BLEU score is an evaluation index for judging the similarity degree of two sentences in machine translation, and the BLEU index is calculated on the fundamental frequency change trend sequence of target voice and example voice and is used as one of characteristics for representing the similarity degree of the two trend sequences. The two methods are that the fundamental frequency sampling point sequence in each word is converted into a variation trend sequence, and then the distance is compared on the variation trend sequence. Specifically, the change trend sequence is to compare each fundamental frequency sampling point with the next point in sequence, if the back point is larger than the current point, the current change trend takes a value of 1, if the back point is smaller than the current point, the current change trend takes a value of-1, if the back point and the current point take the same value, the current change takes a value of 0, and thus the whole fundamental frequency sampling point sequence is converted into a change trend sequence representing the increasing and decreasing change conditions of the whole fundamental frequency sampling point sequence. The longest common subsequence score is specifically calculated, for example, the word "got" has a sequence of fundamental frequency sample points of (65, 68, 74, 90, 90, 87, 76.). And a corresponding sequence of trend of change of (0, 1,0, -1, -1.). (F) o1 ,F o2 ,F o3 ) Corresponding toThe Trend is (Trend) o1 ,Trend o2 ,Trend o3 );(F s1 ,F s2 ,F s3 ) The corresponding Trend is (Trend) s1 ,Trend s2 ,Trend s3 ). The BLEU score is specifically calculated, for example: calculated (Trend) o1 ,Trend o2 ,Trend o3 ) And (Trend) e1 ,Trend s2 ,Trend s3 ) And obtaining three BLEU scores between the two variation trend sequences corresponding to each word, wherein the average value of the three scores is the BLEU score characteristic of the sentence.
The word-level fundamental frequency range class characteristic value mainly calculates the proportion of a target audio frequency value range in an example audio voice fundamental frequency value range, and comprises the following steps:
a. the value range of the fundamental frequency of the target audio word accounts for the maximum value of the ratio of the value range of the example fundamental frequency of the audio;
b. the fundamental frequency value range of the target audio word accounts for the minimum value of the ratio of the example audio fundamental frequency value range;
c. the target audio word fundamental frequency value range accounts for the average value of the ratio of the example audio fundamental frequency value ranges;
d. the base frequency range of the target audio whole sentence accounts for the proportion of the base frequency range of the example audio whole sentence.
Specifically, for example, for the audio frequency of each word of the target audio frequency and the example audio frequency, a value range (maximum value minus minimum value) of the fundamental frequency in the time range can be obtained, and for each word of the target audio frequency and the example audio frequency, a ratio of the value range of the fundamental frequency of the target audio frequency word to the value range of the example audio frequency fundamental frequency is obtained; the maximum of these ratios constitutes the characteristic a; the minimum value of the ratio constitutes feature b; the average of all ratios constitutes the characteristic c; in the whole sentence, the ratio of the target audio fundamental frequency range to the example audio fundamental frequency range constitutes the characteristic d.
The word-level polynomial fitting coefficient characteristic value is that polynomials from a low order to a high order are used for fitting the fundamental frequency sequence of the target audio whole sentence respectively, the fitting coefficient of each order of polynomial is used as a class of characteristic, and an error term when each polynomial is fitted is also used as the characteristic. Fitting a polynomial of order n, which provides a number of features of n + 1. Polynomials of order 1 to 7 are actually used, so that the number of this class of features is 2+3+. +8= 35.
The specific calculation is as follows: the fundamental sequence of the whole sentence of the target audio is a series of numbers (an array) to which a polynomial can be fitted. For example, assuming that the numbers are (a, b, c, d, e, f.), if a first order polynomial is used to fit the numbers, assuming that the resulting first order polynomial is mx + n and the fitting error is p, then the fitting coefficient m and the fitting error p are two features provided as a first order polynomial; if fitting is done with a second order polynomial, the assumption is that the resulting second order polynomial is mx 2 + nx + o, fitting error p, then fitting coefficients m, n and fitting error p are 3 features provided as a second order polynomial; and so on to higher order polynomials.
After obtaining the word-level fundamental frequency similarity characteristic value, another embodiment of the present invention further includes: obtaining second time alignment information of the same syllable of the same word in the text content of the target audio and the example audio; calculating at least one syllable-level fundamental frequency similarity characteristic value between the target audio and the example audio by taking syllable as a unit according to the fundamental frequency envelope and the second time alignment information; and taking the at least two word-level fundamental frequency similarity characteristic values and the at least one syllable-level fundamental frequency similarity characteristic value as the input of a preset intonation scoring model to obtain an intonation scoring result of the target audio.
A syllable is the smallest phonetic unit of a phonetic transcription from the combination of a single vowel phone and a consonant phone, and a single vowel phone can be self-syllabled. For Chinese, the syllable in the embodiment of the invention is formed by combining vowel phoneme and consonant phoneme. Obviously, dividing the text content according to syllables results in smaller text segments than dividing the text content in units of words, and correspondingly, results in smaller and more example audio segments and target audio segments corresponding to one sentence.
Taking English as an example, the pronunciation phonetic symbol of each word in the text content is obtained by searching the database, so as to obtain the syllable corresponding to the whole text content.
The complete audio corresponding to the text content can be stored in the system; standard pronunciations of words constituting the text content may also be stored, the standard words being pronounced from the text to obtain example audio corresponding to the text content; or calling syllable audio to obtain example audio corresponding to the text content according to the pronunciation of each word in the text content; and other methods than these do not affect the achievement of the objects of the present invention.
In dividing the text content, it is preferable to divide each syllable of each word in the text content into separate text segments.
Alternatively, a text segment may contain two or more syllables. Syllables that are divided into the same text segment may belong to the same word or to different words that are concatenated one after the other. The invention is not limited.
The text content is divided by taking syllables as units, so that a plurality of second text segments containing the same number of syllables or second text segments containing different numbers of syllables can be obtained. That is, similar to dividing the text content by the word unit, when dividing the text content by the syllable unit, the number of syllables included in each text segment is not the same. Certain rules or algorithms may be used to obtain text segments with different numbers of words or partially the same number of words.
Next, at least one pitch-level fundamental frequency similarity feature value may be calculated based on the fundamental frequency envelopes of the target audio and the example audio and second time alignment information of the same syllable of the same word in the text content.
According to the starting and ending time of the same syllable of the same word in the target audio and the example audio, the 25 fundamental frequency sequence points of the syllable are sampled at equal intervals to obtain the average value of 25 fundamental frequency sampling point sequences, comparison analysis is carried out through fundamental frequency characteristics of multiple dimensions, at least one pitch-level fundamental frequency similarity characteristic value of the target audio and the example audio can be calculated, then at least two word-level fundamental frequency similarity characteristic values and at least one pitch-level fundamental frequency similarity characteristic value between the target audio and the example audio are input into a preset tone scoring model, compared with the mode that only at least two word-level fundamental frequency similarity characteristic values are used as the input of the model, the types and the number of the fundamental frequency similarity characteristic values for comparison scoring between the target audio and the example audio are increased, and therefore the accuracy of scoring of the target audio can be further improved.
In the above embodiment, the syllable-level fundamental frequency similarity feature values include a syllable-level basic statistic feature value, a syllable-level fundamental frequency variation trend feature value, and a syllable-level distance correlation feature value.
In particular, the syllable level base statistic feature value refers to the ratio of the variance of the fundamental frequency within the target audio syllable to the variance of the fundamental frequency within the example audio syllable. Similar to the characteristics of the word level, each word comprises a plurality of syllables, the variance of all fundamental frequency sampling points in each syllable is calculated, then the ratio of the variance of the fundamental frequency in the target phonetic syllable to the variance of the fundamental frequency in the example phonetic syllable on the corresponding syllable is calculated, and finally the average value of the ratios of the variances on all syllables of the whole sentence is calculated, so that the characteristics are obtained.
For example, assuming that a sentence contains 5 syllables, the variance of the fundamental frequency of the target speech on the 5 syllables is d1, d2, d3, d4 and d5, and the variance of the fundamental frequency of the example speech on the 5 syllables is e1, e2, e3, e4 and e5, the calculation method of the feature is: (d 1/e1+ d2/e2+ d3/e3+ d4/e4+ d5/e 5)/5.
The syllable-level fundamental frequency variation trend feature includes a syllable longest common subsequence score, i.e., a ratio of the longest common subsequence length of the fundamental frequency variation trend sequence of the calculated target audio and the example audio to the entire trend sequence length at each syllable.
For example: assuming that a sentence contains 5 syllables, the longest common subsequence length of the fundamental frequency trend sequence of the target speech and the example speech on the 5 syllables is L1, L2, L3, L4 and L5, and the length of the fundamental frequency trend sequence on the 5 syllables is T1, T2, T3, T4 and T5, the calculation method of the characteristic is as follows: (L1/T1 + L2/T2+ L3/T3+ L4/T4+ L5/T5)/5.
Syllable-level distance-related features refer to individual distance features (e.g., euclidean distance, manhattan distance, DTW distance, etc.) on a syllable, and then the distance scores for the individual syllables are averaged over the evidence.
And inputting the characteristic values into a pre-trained preset model, so that the tone scores of the target audio can be predicted.
Fig. 2 is a schematic flow chart of a intonation scoring method according to another embodiment of the present application.
Referring to fig. 2, according to another embodiment of the present application, a intonation scoring method includes:
step S200: acquiring target audio corresponding to text content input by a user; and obtaining example audio corresponding to the textual content.
Step S202: respectively calculating fundamental frequency envelopes of the target audio and the example audio; and obtaining second time alignment information of the same syllable of the same word in the text content of the target audio and the example audio, respectively.
Step S204: at least two dimensional pitch-level fundamental frequency similarity feature values between the target audio and the example audio are calculated in units of syllables according to the fundamental frequency envelope and the second time alignment information.
Step S206: and taking the similarity characteristic values of the at least two pitch-level fundamental frequencies as the input of a preset intonation scoring model to obtain an intonation scoring result of the target audio.
In this embodiment, the calculated fundamental frequency envelope between the target audio and the example audio and the second time alignment information of the same syllable of the same word in the text content are used to calculate at least two syllable-level fundamental frequency similarity feature values between the target audio and the example audio in units of syllables, the method is the same as any one of the above embodiments, and then the at least two syllable-level fundamental frequency similarity feature values are used as the input of the preset score prediction model, so that the example audio of the target audio can be predicted, and the spoken language intonation is scored relatively to the situation of single variable, so that the adaptability and scoring effect are better, and the reasonability and the accuracy of scoring the target audio are improved.
In some embodiments, the at least two pitch level fundamental frequency similarity feature values include a pitch level fundamental statistic feature value, a pitch level fundamental frequency trend feature value, and a pitch level distance-related feature value.
In the above embodiment, the method further includes: obtaining first time alignment information of the same word in the text content of the target audio and the example audio; calculating at least one word-level fundamental frequency similarity characteristic value of the target audio and the example audio by taking a word as a unit according to the fundamental frequency envelope and the first time alignment information; and taking the at least two pitch-level fundamental frequency similarity characteristic values and the at least one word-level fundamental frequency similarity characteristic value as the input of a preset intonation scoring model to obtain an intonation scoring result of the target audio.
In some embodiments, the word-level fundamental frequency similarity feature values include word-level basic statistic feature values, word-level distance feature values, word-level fundamental frequency range class feature values, and word-level polynomial fit coefficient feature values.
Fig. 3 is a flowchart illustrating a tone scoring method according to another embodiment of the present application. In the embodiment, the text content is divided according to a single word to obtain first-class text segments, namely, each first-class segment only contains one word; and, in addition, the text content is divided according to a single syllable to obtain second type text segments, namely, each second type text segment contains a syllable.
Referring to fig. 3, according to another embodiment of the present application, a spoken utterance scoring method includes:
step S300: acquiring target audio corresponding to text content input by a user; and obtaining example audio corresponding to the textual content.
Step S302: respectively calculating fundamental frequency envelopes of the target audio and the example audio; first time alignment information of the same word in the text content of the target audio and the example audio, and second time alignment information of the same syllable of the same word in the text content of the target audio and the example audio are obtained, respectively.
Step S304: and calculating the word-level fundamental frequency similarity characteristic value of the target audio and the example audio in units of words according to the fundamental frequency envelope and the first time alignment information.
Step S306: and calculating syllable-level fundamental frequency similarity characteristic values of the target audio and the example audio by taking syllable as a unit according to the fundamental frequency envelope and the second time alignment information.
Step S308: and taking the word-level fundamental frequency similarity characteristic value of at least one dimension and the syllable-level fundamental frequency similarity characteristic value of at least one dimension as the input of a preset model to obtain the intonation scoring result of the target audio.
In this embodiment, according to the fundamental frequency envelopes of the target audio and the example audio and the first alignment information of the same word in the corresponding text in time, and the fundamental frequency envelopes of the target audio and the example audio and the second alignment information of the same syllable of the same word in the corresponding text in time, not only the fundamental frequency similarity characteristic value at the word level but also the fundamental frequency similarity characteristic value at the syllable level can be calculated, and by inputting the fundamental frequency similarity characteristic value at least one word level and the fundamental frequency similarity characteristic value at least one syllable level between the target audio and the example audio into the preset model, compared with inputting only the fundamental frequency similarity characteristic value at the word level or syllable level, the type and number of the fundamental frequency similarity characteristic values for performing comparison and scoring between the target audio and the example audio are increased, so that the accuracy of scoring the target audio can be further improved.
The word-level fundamental frequency similarity feature value and the syllable-level fundamental frequency similarity feature value are described in detail in the above embodiments, and therefore are not described herein again.
In the above method embodiment, the example audio may be preset in the system, and if the system does not save the example audio, the example audio may be obtained by the following method.
And obtaining the language features of words in the text content, and taking the language features of the words as the input of a preset acoustic feature prediction model to obtain the fundamental frequency envelope of the example audio corresponding to the text content.
Specifically, the voice duration of the word is predicted according to the language features of the word, so that the pronunciation duration of the word is obtained; obtaining a language feature sequence which expands the language features of the word to the multiple based on the pronunciation duration of the word and the multiple of the fundamental frequency envelope sampling period; and inputting the expanded language feature sequence into a fundamental frequency prediction module to obtain the fundamental frequency envelope of the audio frequency of the word.
In some embodiments, the language features are vectors that include one or more of the following information: the current word content, the previous word content and the next word content; and the current word syllable number, the previous word syllable number, and the next word syllable number.
The language feature of each word is input into a speech duration prediction neural network, the speech duration prediction neural network can predict the pronunciation duration of each word according to the language feature of each input word, the multiple of the predicted pronunciation duration of each word and the sampling period of fundamental frequency envelope is obtained, the language feature of each word is expanded to the corresponding multiple, a language feature sequence of the expanded word is obtained, the expanded language feature is input into the fundamental frequency prediction neural network, the fundamental frequency envelope of each word can be predicted by the neural network according to the input expanded language feature sequence, and further the fundamental frequency envelope of the whole sentence can be obtained.
The language features may include: a vector of at least one of current word content, previous word content, next word content, the number of syllables contained in the current word, the number of syllables contained in the previous word, and the number of syllables contained in the next word.
In the English business scene, when the pronunciation duration prediction neural network and the fundamental frequency prediction neural network are trained, pronunciation standards and a large amount of English voices and texts related to English teaching contents are used as training data. After training is completed, pronunciation duration and fundamental frequency envelope predicted by the two neural networks can be regarded as a more standard pronunciation result and can be used as example audio for referring to target audio to evaluate intonation scores, and the predicted pronunciation duration of each word and fundamental frequency value of each frame of the whole sentence can be used as acoustic prediction information of the example audio.
Specific examples thereof include: in the sentence "I eat applet", the linguistic feature of the word I is X, X = (I,', eat,1,0, 1); the linguistic feature of the word eat is Y, Y = (eat, I, applet, 1, 2); the linguistic feature of the word applet is Z, Z = (applet, eat,', 2,1, 0). Assuming that the pronunciation duration of the word I is predicted to be 300 milliseconds, the pronunciation duration of eat is predicted to be 400 milliseconds, and the pronunciation duration of the applet is predicted to be 500 milliseconds, if the duration of each frame is 20 milliseconds, the feature of I is expanded 15 times (300/20), the feature Y of eat is expanded 20 times (400/20), and the feature Z of the applet is expanded 25 times (500/20), so that the fundamental frequency value of each frame of the sentence "I eat applet" can be obtained.
Fig. 4 is a block diagram illustrating a schematic structure of a intonation scoring system according to an embodiment of the present application. Taking an english application scenario as an example, the english tone scoring system shown in fig. 4 includes:
an audio input module 101, configured to input a target audio corresponding to the text content and an example audio corresponding to the text content;
the audio preprocessing module 102 is electrically connected to the audio input module, and is configured to calculate fundamental frequency envelopes of the target audio and the example audio, and calculate time alignment information of a same text segment in text content corresponding to the target audio and the example audio;
a fundamental frequency similarity characteristic value calculating module 103, electrically connected to the audio preprocessing module, configured to calculate at least two classes of fundamental frequency similarity characteristic values between the target audio and the example audio according to the respective fundamental frequency envelopes of the target audio and the example audio and the time alignment information;
and the prediction score module 104 is electrically connected with the fundamental frequency similarity characteristic value calculation module and is used for scoring the intonation of the target audio according to the at least two classes of fundamental frequency similarity characteristic values.
In the above system, the text segments include a first type of text segment, and the first type of text segment includes at least one word;
the audio processing module is used for calculating first time alignment information of the same first type of text segments in the text content corresponding to the target audio and the example audio;
and the fundamental frequency similarity characteristic value calculation module is used for calculating at least two types of fundamental frequency similarity characteristic values between the target audio and the example audio based on the first type of text segments according to the fundamental frequency envelope and the first time alignment information.
In the above system, the audio preprocessing module further includes: dividing the text content to obtain at least two first-class text segments, wherein the number of words included in each first-class text segment is the same; or, the text content is divided to obtain at least two first-class text segments, and the number of words contained in at least two first-class text segments is different.
Wherein, the fundamental frequency similarity characteristic value obtained based on the first type text segment comprises: basic statistic eigenvalue, distance eigenvalue, fundamental frequency range class eigenvalue and polynomial fitting coefficient eigenvalue.
In the above system, the audio preprocessing module further includes: dividing the text content into a plurality of second type text segments, wherein the second type text segments comprise at least one syllable;
obtaining second time alignment information of a same second type of text segment in the text content corresponding to the target audio and the example audio;
the fundamental frequency similarity characteristic value calculation module is used for calculating at least one class of syllable-level fundamental frequency similarity characteristic values between the target audio and the example audio based on a second class of text patches according to the fundamental frequency envelope and the second time alignment information;
and the prediction score module inputs the at least two fundamental frequency similarity characteristic values based on the first type of text segments and the at least one fundamental frequency similarity characteristic value based on the second type of text segments into a preset intonation scoring model to obtain an intonation scoring result of the target audio.
The word-level fundamental frequency similarity feature value and the syllable-level fundamental frequency similarity feature value are described in detail in the above embodiments, and therefore, are not described herein again.
With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the intonation scoring method, and will not be elaborated herein.
The solution of the present application has been described in detail hereinabove with reference to the drawings. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments. Those skilled in the art should also appreciate that the acts and modules referred to in the specification are not necessarily required in the present application. In addition, it can be understood that the steps in the tone scoring method of the embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and the modules in the device of the embodiment of the present application may be combined, divided, and deleted according to actual needs.
Referring to fig. 5, computing device 200 includes memory 201 and processor 202.
The Processor 202 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 201 may include various types of storage units such as system memory, read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 202 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at run-time. Further, the memory 201 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 201 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 201 has stored thereon executable code that, when processed by the processor 202, may cause the processor 202 to perform some or all of the methods described above.
Furthermore, the intonation scoring method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing some or all of the steps of the above-described intonation scoring method of the present application.
Alternatively, the present application may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) that, when executed by a processor of a computing device (or an electronic device, a server, or the like), causes the processor to perform part or all of the steps of the above-described intonation scoring method according to the present application.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the applications disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and intonation scoring methods according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (21)

1. A intonation scoring method, comprising:
acquiring target audio corresponding to text content input by a user;
obtaining example audio corresponding to the text content;
respectively obtaining fundamental frequency envelopes of the target audio and the example audio;
respectively obtaining time alignment information of the same text segments in the text contents corresponding to the target audio and the example audio;
calculating at least two types of fundamental frequency similarity characteristic values between the target audio and the example audio according to the fundamental frequency envelopes and the time alignment information;
and inputting the at least two types of fundamental frequency similarity characteristic values into a preset tone scoring model to obtain tone scoring of the target audio.
2. The method of claim 1, wherein:
the text segments comprise a first type of text segment, and the first type of text segment comprises at least one word;
the obtaining time alignment information of the same text segment in the text content corresponding to the target audio and the example audio respectively, and calculating at least two types of fundamental frequency similarity characteristic values between the target audio and the example audio according to the fundamental frequency envelope and the time alignment information includes:
respectively obtaining first time alignment information of the target audio and the example audio corresponding to the same first type of text segments in the text content;
and calculating at least two types of fundamental frequency similarity characteristic values between the target audio and the example audio based on the first type of text segments according to the fundamental frequency envelopes and the first time alignment information.
3. The method of claim 2, further comprising:
dividing the text content to obtain at least two first-class text segments, wherein the number of words included in each first-class text segment is the same;
or dividing the text content to obtain at least two first-class text segments, wherein the at least two first-class text segments have different word numbers.
4. The method of claim 3, wherein:
the fundamental frequency similarity characteristic value obtained based on the first type text segment comprises a basic statistic characteristic value, a distance characteristic value, a fundamental frequency range characteristic value and a polynomial fitting coefficient characteristic value.
5. The method of claim 4, wherein:
the basic statistic characteristic values obtained based on the first type of text segments comprise minimum fundamental frequency, maximum fundamental frequency, fundamental frequency median and fundamental frequency value range.
6. The method of claim 4, wherein:
the distance characteristic values obtained based on the first type text segments comprise Euclidean distance, manhattan distance, DTW distance, longest common subsequence score and BLEU score.
7. The method of claim 5, wherein the fundamental frequency range class eigenvalues comprise:
respectively calculating fundamental frequency value ranges of target audio and example audio of each text segment;
calculating the ratio of the fundamental frequency value range of each text segment of the target audio to the fundamental frequency value range of each text segment of the example audio;
the maximum value, or the minimum value, or the average value of all the ratios is obtained.
8. The method of claim 5, wherein the plurality of fitting coefficient eigenvalues comprise:
fitting the fundamental frequency sequence of the target audio frequency by adopting a polynomial to obtain a fitting coefficient sequence as a characteristic value.
9. The method of any of claims 2 to 8, further comprising:
dividing the text content into a plurality of second type text segments, wherein the second type text segments comprise at least one syllable;
obtaining second time alignment information of the same second type of text segments in the text contents corresponding to the target audio and the example audio;
calculating at least one class of syllable-level fundamental frequency similarity characteristic value based on a second class of text pieces between the target audio and the example audio according to the fundamental frequency envelope and the second time alignment information;
inputting the at least two types of fundamental frequency similarity characteristic values based on the first type of text segments and the at least one type of fundamental frequency similarity characteristic values based on the second type of text segments into a preset intonation scoring model to obtain a result of the intonation scoring of the target audio.
10. The method of claim 9, wherein the dividing the text content into a plurality of second type text segments comprises:
dividing the text content to obtain at least two second-class text segments, wherein the number of syllables of each second-class text segment is the same;
or, the text content is divided to obtain at least two second-type text segments, and the number of syllables of the at least two second-type text segments is different.
11. The method of claim 10, wherein:
the fundamental frequency similarity characteristic value based on the second type text segment comprises a basic statistic characteristic value, a fundamental frequency change trend characteristic value and a distance correlation characteristic value.
12. The method of claim 9, wherein obtaining example audio corresponding to the text content, obtaining a fundamental frequency envelope of the example audio comprises:
and obtaining the language features of the words in the text content, inputting the language features of the words into a preset acoustic feature prediction model, and obtaining the fundamental frequency envelope of the example audio corresponding to the text content.
13. The method of claim 12, wherein inputting the linguistic features of the words into a preset acoustic feature prediction model to obtain a fundamental frequency envelope of the example audio corresponding to the text content comprises:
inputting the language features of the words into a voice duration prediction module to obtain the pronunciation duration of the words;
obtaining a language feature sequence which expands the language features of the word to the multiple based on the pronunciation duration of the word and the multiple of the fundamental frequency envelope sampling period;
and inputting the expanded language feature sequence into a fundamental frequency prediction module to obtain the fundamental frequency envelope of the audio frequency of the word.
14. The method of claim 13, wherein the linguistic feature is a vector comprising one or more of the following information:
current word content, previous word content and next word content; and the current word syllable count, the previous word syllable count, and the next word syllable count.
15. A intonation scoring system, comprising:
an audio input module for inputting a target audio corresponding to the text content and an example audio corresponding to the text content;
the audio preprocessing module is electrically connected with the audio input module and used for calculating fundamental frequency envelopes of the target audio and the example audio and calculating time alignment information of the same text segment in text content corresponding to the target audio and the example audio;
a fundamental frequency similarity characteristic value calculating module, electrically connected to the audio preprocessing module, for calculating at least two classes of fundamental frequency similarity characteristic values between the target audio and the example audio according to the respective fundamental frequency envelopes of the target audio and the example audio and the time alignment information;
and the prediction score module is electrically connected with the fundamental frequency similarity characteristic value calculation module and is used for scoring the intonation of the target audio according to the at least two classes of fundamental frequency similarity characteristic values.
16. The system of claim 15, wherein:
the text segments comprise a first type of text segment, and the first type of text segment comprises at least one word;
the audio processing module is used for calculating first time alignment information of the same first type of text segments in the text content corresponding to the target audio and the example audio;
and the fundamental frequency similarity characteristic value calculation module is used for calculating at least two types of fundamental frequency similarity characteristic values between the target audio and the example audio based on the first type of text segments according to the fundamental frequency envelope and the first time alignment information.
17. The system of claim 16, wherein the audio pre-processing module is further configured to:
dividing the text content to obtain at least two first-class text segments, wherein the number of words included in each first-class text segment is the same;
or, the text content is divided to obtain at least two first-class text segments, and the number of words contained in at least two first-class text segments is different.
18. The system of claim 17,
the fundamental frequency similarity characteristic value obtained based on the first type text segment comprises the following steps: basic statistic eigenvalue, distance eigenvalue, fundamental frequency range class eigenvalue and polynomial fitting coefficient eigenvalue.
19. The system of claim 17, wherein the audio pre-processing module further comprises:
dividing the text content into a plurality of second type text segments, wherein the second type text segments comprise at least one syllable;
obtaining second time alignment information of a same second type of text segment in the text content corresponding to the target audio and the example audio;
the fundamental frequency similarity characteristic value calculation module is used for calculating at least one class of syllable-level fundamental frequency similarity characteristic values between the target audio and the example audio based on a second class of text patches according to the fundamental frequency envelope and the second time alignment information;
and the prediction score module inputs the at least two fundamental frequency similarity characteristic values based on the first type of text segments and the at least one fundamental frequency similarity characteristic value based on the second type of text segments into a preset intonation scoring model to obtain an intonation scoring result of the target audio.
20. A computing device, comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-14.
21. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of a computing device, causes the processor to perform the method of any of claims 1-14.
CN202110338134.4A 2021-03-30 2021-03-30 Intonation scoring method, intonation scoring system, computing device and storage medium Pending CN115148225A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110338134.4A CN115148225A (en) 2021-03-30 2021-03-30 Intonation scoring method, intonation scoring system, computing device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110338134.4A CN115148225A (en) 2021-03-30 2021-03-30 Intonation scoring method, intonation scoring system, computing device and storage medium

Publications (1)

Publication Number Publication Date
CN115148225A true CN115148225A (en) 2022-10-04

Family

ID=83404516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110338134.4A Pending CN115148225A (en) 2021-03-30 2021-03-30 Intonation scoring method, intonation scoring system, computing device and storage medium

Country Status (1)

Country Link
CN (1) CN115148225A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115148224A (en) * 2021-03-30 2022-10-04 北京猿力未来科技有限公司 Intonation scoring method, intonation scoring system, computing device and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060074655A1 (en) * 2004-09-20 2006-04-06 Isaac Bejar Method and system for the automatic generation of speech features for scoring high entropy speech
US20060178874A1 (en) * 2003-03-27 2006-08-10 Taoufik En-Najjary Method for analyzing fundamental frequency information and voice conversion method and system implementing said analysis method
CN101727902A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Method for estimating tone
US20100145698A1 (en) * 2008-12-01 2010-06-10 Educational Testing Service Systems and Methods for Assessment of Non-Native Spontaneous Speech
CN101751919A (en) * 2008-12-03 2010-06-23 中国科学院自动化研究所 Spoken Chinese stress automatic detection method
CN102163428A (en) * 2011-01-19 2011-08-24 无敌科技(西安)有限公司 Method for judging Chinese pronunciation
US20160253999A1 (en) * 2015-02-26 2016-09-01 Arizona Board Of Regents Systems and Methods for Automated Evaluation of Human Speech
CN106856095A (en) * 2015-12-09 2017-06-16 中国科学院声学研究所 The voice quality evaluating system that a kind of phonetic is combined into syllables
KR20180048136A (en) * 2016-11-02 2018-05-10 한국전자통신연구원 Method for pronunciation assessment and system using the method
CN109545189A (en) * 2018-12-14 2019-03-29 东华大学 A kind of spoken language pronunciation error detection and correcting system based on machine learning
CN111951825A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Pronunciation evaluation method, medium, device and computing equipment
CN112331180A (en) * 2020-11-03 2021-02-05 北京猿力未来科技有限公司 Spoken language evaluation method and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060178874A1 (en) * 2003-03-27 2006-08-10 Taoufik En-Najjary Method for analyzing fundamental frequency information and voice conversion method and system implementing said analysis method
US20060074655A1 (en) * 2004-09-20 2006-04-06 Isaac Bejar Method and system for the automatic generation of speech features for scoring high entropy speech
CN101727902A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Method for estimating tone
US20100145698A1 (en) * 2008-12-01 2010-06-10 Educational Testing Service Systems and Methods for Assessment of Non-Native Spontaneous Speech
CN101751919A (en) * 2008-12-03 2010-06-23 中国科学院自动化研究所 Spoken Chinese stress automatic detection method
CN102163428A (en) * 2011-01-19 2011-08-24 无敌科技(西安)有限公司 Method for judging Chinese pronunciation
US20160253999A1 (en) * 2015-02-26 2016-09-01 Arizona Board Of Regents Systems and Methods for Automated Evaluation of Human Speech
CN106856095A (en) * 2015-12-09 2017-06-16 中国科学院声学研究所 The voice quality evaluating system that a kind of phonetic is combined into syllables
KR20180048136A (en) * 2016-11-02 2018-05-10 한국전자통신연구원 Method for pronunciation assessment and system using the method
CN109545189A (en) * 2018-12-14 2019-03-29 东华大学 A kind of spoken language pronunciation error detection and correcting system based on machine learning
CN111951825A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Pronunciation evaluation method, medium, device and computing equipment
CN112331180A (en) * 2020-11-03 2021-02-05 北京猿力未来科技有限公司 Spoken language evaluation method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
EMILIO MOLINA: "Fundamental frequency alignment vs. note-based melodic similarity for singing voice assessment", 《2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING》, 21 October 2013 (2013-10-21) *
肖雨佳: "基于音段和韵律分析的发音质量评测研究", 《中国优秀硕士学位论文全文数据库》, 15 December 2018 (2018-12-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115148224A (en) * 2021-03-30 2022-10-04 北京猿力未来科技有限公司 Intonation scoring method, intonation scoring system, computing device and storage medium

Similar Documents

Publication Publication Date Title
US10586533B2 (en) Method and device for recognizing speech based on Chinese-English mixed dictionary
US11308938B2 (en) Synthesizing speech recognition training data
Halberstadt Heterogeneous acoustic measurements and multiple classifiers for speech recognition
EP3832644B1 (en) Neural speech-to-meaning translation
US20080059190A1 (en) Speech unit selection using HMM acoustic models
US20140350934A1 (en) Systems and Methods for Voice Identification
CN111402862B (en) Speech recognition method, device, storage medium and equipment
EP2306345A2 (en) Speech retrieval apparatus and speech retrieval method
JP5007401B2 (en) Pronunciation rating device and program
US20230298564A1 (en) Speech synthesis method and apparatus, device, and storage medium
Yuan et al. Using forced alignment for phonetics research
CN110223674B (en) Speech corpus training method, device, computer equipment and storage medium
Mary et al. Searching speech databases: features, techniques and evaluation measures
CN114360514A (en) Speech recognition method, apparatus, device, medium, and product
CN115148225A (en) Intonation scoring method, intonation scoring system, computing device and storage medium
Jarifi et al. A fusion approach for automatic speech segmentation of large corpora with application to speech synthesis
KR20130126570A (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
Prahallad Automatic building of synthetic voices from audio books
Moungsri et al. Unsupervised Stress Information Labeling Using Gaussian Process Latent Variable Model for Statistical Speech Synthesis.
CN115148224A (en) Intonation scoring method, intonation scoring system, computing device and storage medium
Nanmalar et al. Literary and Colloquial Tamil Dialect Identification
US20230037541A1 (en) Method and system for synthesizing speeches by scoring speeches
Magnotta Analysis of Two Acoustic Models on Forced Alignment of African American English
Moungsri et al. Tone modeling using Gaussian process latent variable model for statistical speech synthesis
CN114566147A (en) Speech evaluation method, computer device, storage medium, and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination