CN110782918A

CN110782918A - Voice rhythm evaluation method and device based on artificial intelligence

Info

Publication number: CN110782918A
Application number: CN201910969890.XA
Authority: CN
Inventors: 林炳怀; 王丽园
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2020-02-11
Anticipated expiration: 2039-10-12
Also published as: CN110782918B

Abstract

The invention provides a voice rhythm evaluation method, a device, electronic equipment and a storage medium based on artificial intelligence; the method comprises the following steps: receiving voice data to be evaluated and text data corresponding to the voice data to be evaluated; determining a prosodic standard of pronunciation corresponding to the text data; performing prosody detection processing on the voice data to be evaluated to obtain pronunciation characteristics and rhythm characteristics of the voice data to be evaluated; comparing the pronunciation characteristics with standard pronunciation characteristics corresponding to the prosodic standards to obtain pronunciation characteristic evaluation results, and comparing the rhythm characteristics with standard rhythm characteristics corresponding to the prosodic standards to obtain rhythm characteristic evaluation results; and performing evaluation processing based on the pronunciation characteristic evaluation result and the rhythm characteristic evaluation result through a decision tree model to obtain the rhythm score of the voice data to be evaluated. By the method and the device, the accurate prosody score of the voice data can be obtained.

Description

Voice rhythm evaluation method and device based on artificial intelligence

Technical Field

The present invention relates to artificial intelligence voice processing technologies, and in particular, to an artificial intelligence based voice prosody assessment method, apparatus, electronic device, and storage medium.

Background

Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, for example, natural language processing technology and machine learning/deep learning, etc., and along with the development of the technology, the artificial intelligence technology can be applied in more fields and can play more and more important values.

The evaluation of the prosody of speech is an important application field of artificial intelligence technology, and aims to evaluate the prosody of pronunciation of input speech data. The voice prosody evaluation technique provided by the related art generally obtains the similarity with the standard pronunciation by comparing the pronunciation of the speaker with the standard pronunciation, as a final prosody score; or extracting effective prosodic features, and obtaining prosodic scores based on rules or fitting expert prosodic scores to obtain prosodic scores.

However, the speech prosody assessment techniques provided by the related art rely on a large amount of standard audio and are not suitable for some scenarios such as free-speaking. In addition, the prosody score output is less relevant to experts.

Disclosure of Invention

The embodiment of the invention provides a voice prosody evaluation method and device based on artificial intelligence, electronic equipment and a storage medium, which can obtain accurate prosody scoring of voice data.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a voice prosody evaluation method based on artificial intelligence, which comprises the following steps:

receiving voice data to be evaluated and text data corresponding to the voice data to be evaluated;

determining a prosodic standard of pronunciation corresponding to the text data;

performing prosody detection processing on the voice data to be evaluated to obtain pronunciation characteristics and rhythm characteristics of the voice data to be evaluated;

comparing the pronunciation features with corresponding standard pronunciation features in the prosodic standard to obtain pronunciation feature evaluation results, an

Comparing the rhythm characteristics with standard rhythm characteristics corresponding to the rhythm standard to obtain rhythm characteristic evaluation results;

and performing evaluation processing based on the pronunciation characteristic evaluation result and the rhythm characteristic evaluation result through a decision tree model to obtain the rhythm score of the voice data to be evaluated.

The embodiment of the invention provides a voice rhythm evaluation device based on artificial intelligence, which comprises:

the receiving module is used for receiving voice data to be evaluated and text data corresponding to the voice data to be evaluated;

the prosody generating module is used for generating a prosody standard of pronunciations corresponding to the text data;

the rhythm detection module is used for carrying out rhythm detection processing on the voice data to be evaluated to obtain pronunciation characteristics and rhythm characteristics of the voice data to be evaluated;

a prosody evaluation module for comparing the pronunciation characteristics with standard pronunciation characteristics corresponding to the prosody standard to obtain pronunciation characteristic evaluation results, and

the rhythm characteristic is used for comparing with a standard rhythm characteristic corresponding to the rhythm standard to obtain a rhythm characteristic evaluation result;

the prosody evaluation module is further configured to perform evaluation processing based on the pronunciation feature evaluation result and the rhythm feature evaluation result to obtain a prosody score of the voice data to be evaluated.

In the above scheme, the prosody generation module is further configured to generate a standard pronunciation feature and a standard rhythm feature corresponding to the text data;

wherein the standard pronunciation characteristics include a standard reread position, a standard pause position, and a standard boundary key type.

In the above scheme, the prosody evaluation module is further configured to compare an overreading position included in the pronunciation feature with a standard overreading position included in the standard pronunciation feature, so as to obtain an overreading error rate of the pronunciation feature;

comparing the pause position included in the pronunciation characteristic with the standard pause position included in the standard pronunciation characteristic to obtain the pause error rate of the pronunciation characteristic;

and comparing the boundary key type included in the pronunciation characteristic with the standard boundary key type included in the standard pronunciation characteristic to obtain the boundary key type error rate of the pronunciation characteristic.

In the above scheme, the prosody evaluation module is further configured to determine a time length difference coefficient between the stressed syllables of the speech data to be evaluated, and determine a time length difference coefficient between the stressed syllables in a prosody standard of the pronunciation corresponding to the text data;

and based on the time length difference coefficient between the stressed syllables in the rhythm standard, normalizing the time length difference coefficient between the stressed syllables of the voice data to be evaluated, and determining the time length difference coefficient after normalization as a rhythm characteristic evaluation result.

In the above scheme, the prosody evaluation module is further configured to determine a time difference between two adjacent stressed syllables in the speech data to be evaluated;

determining a standard deviation of the time difference and an average of the time differences;

and determining the quotient of the standard deviation and the average value as a time length difference coefficient between the stressed syllables of the voice data to be evaluated.

In the above scheme, the prosody evaluation module is further configured to determine a standard deviation and an average value of a time difference coefficient between the stressed syllables in the prosody standard;

and normalizing the time length difference coefficient between the stressed syllables of the voice data to be evaluated based on the standard deviation and the average value of the time length difference coefficient between the stressed syllables in the rhythm standard.

In the above scheme, the prosody evaluation module is further configured to score the pronunciation feature evaluation result and the rhythm feature evaluation result through a node in a decision tree model, and perform weighting processing on the obtained score according to the weight of the node to obtain the prosody score of the voice data to be evaluated.

In the above scheme, the prosody generation module is further configured to label the corresponding position of the text data according to the pronunciation feature evaluation result, and return the labeled text data to the user terminal.

In the above scheme, the prosody evaluation module is further configured to obtain a voice data sample and a corresponding prosody score, and perform prosody detection processing on the voice data sample to obtain a corresponding pronunciation feature and a corresponding rhythm feature;

selecting features with classification capability from pronunciation features and rhythm features of the voice data samples as nodes to construct an initial decision tree model;

and pruning the constructed initial decision tree model to obtain the decision tree model for prosody scoring.

a memory for storing executable instructions;

and the processor is used for realizing the artificial intelligence-based voice prosody evaluation method provided by the embodiment of the invention when the executable instructions stored in the memory are executed.

The embodiment of the invention provides a storage medium, which stores executable instructions and is used for causing a processor to execute the executable instructions so as to realize the artificial intelligence-based voice prosody evaluation method provided by the embodiment of the invention.

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides a voice prosody evaluation method based on artificial intelligence, which has no limitation on voice data to be evaluated, and a user can freely input any text data and read the text data, so that the application range of the voice prosody evaluation is expanded. In addition, the prosody of the voice data to be evaluated is evaluated by comprehensively considering the pronunciation characteristic evaluation result and the rhythm characteristic evaluation result, so that accurate prosody score of the voice data to be evaluated can be obtained.

Drawings

FIG. 1 is a schematic diagram of an alternative architecture of an artificial intelligence based speech prosody evaluation system according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an alternative structure of an artificial intelligence-based apparatus for evaluating speech prosody provided by an embodiment of the present invention;

FIG. 3 is a schematic flow chart of an alternative method for artificial intelligence based prosody assessment of speech according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of an alternative method for artificial intelligence based prosody assessment of speech according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a decision tree model for prosody evaluation of speech according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating an exemplary application scenario of a speech prosody assessment method based on artificial intelligence according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating an exemplary application scenario of a speech prosody assessment method based on artificial intelligence according to an embodiment of the present invention;

FIG. 8 is a schematic flow chart of an alternative method for artificial intelligence based prosody assessment of speech according to an embodiment of the present invention;

fig. 9 is a histogram of difference distribution of pronunciation rereading durations of two speakers and a histogram of difference distribution of pronunciation rereading durations of a native speaker according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) Re-reading: the words that are read heavily in the sentence correspond to the words that are read lightly.

2) Stopping: pauses between intonation phrases in a sentence.

3) Boundary adjustment: the tone curve of the end of a sentence refers to the tone variation trend from the last stressed syllable to the end of the sentence, and is divided into ascending, descending and the like.

The inventor finds that in the process of implementing the embodiment of the invention, when performing the prosody assessment of the voice, the related art generally obtains the similarity with the standard pronunciation as the final prosody score by comparing the pronunciation of the speaker with the standard pronunciation. For example, the similarity between the fundamental frequency F0 and the duration can be calculated based on the reference audio of a large number of speakers and the pronunciation of a bilingual speaker, the similarity distribution is modeled by a Gaussian Mixture Model (GMM), the similarity distribution is input into a Deep neural network model (DNN), two classifications of the speakers and the bilingual are performed, and if the classification result is closer to the speaker, the pronunciation rhythm is better. Or constructing a training network which consists of two parts, wherein one part predicts the sentence accent, the other part detects the sentence accent, the predicted result is used as a label, and the label is compared with the detected sentence accent to obtain the similarity which is used as a final score. Or clustering the fundamental frequency F0 curve of the parent speaker, calculating the distance between the F0 curve of the actual pronunciation of the two speakers and the clustering center of the F0 curve of the parent speaker, and selecting the minimum distance as the similarity. And simultaneously, similarity calculation is carried out on the energy curve, and finally, the characteristics are integrated and fitted with experts for scoring by combining normalized duration characteristics. However, the above-mentioned techniques rely on a large amount of standard audio, and require that words are uniformly distributed, have high requirements on training data, and are not suitable for some scenes such as free speech, which greatly limits the application range of the speech prosody assessment technique.

In addition, the related art also provides a method for obtaining a prosody score based on rules or fitting an expert prosody score by extracting effective prosody features. For example, audio features may be extracted, audio energy calculated, and a rereading distribution in sentences obtained by setting a threshold. Based on the stress distribution in the sentence, the time length difference between the stress syllables is calculated by an improved method, and the pronunciation rhythm sense of the bilingual speaker is judged by the time length difference. However, the method is not comprehensive in consideration, and the final prosody score is relatively poor in expert correlation.

In this regard, the scoring of the pronunciation rhythm of the two speakers by the expert is fitted by considering more characteristics related to rhythm evaluation, such as pronunciation characteristics and rhythm characteristics, so that the machine rhythm scoring with higher correlation degree with the expert is obtained, and the speech data to be evaluated and the text data corresponding to the speech data to be evaluated can be received; determining a prosodic standard of pronunciation corresponding to the text data; performing prosody detection processing on the voice data to be evaluated to obtain pronunciation characteristics and rhythm characteristics of the voice data to be evaluated; comparing the pronunciation characteristics with standard pronunciation characteristics corresponding to the prosodic standards to obtain pronunciation characteristic evaluation results, and comparing the rhythm characteristics with standard rhythm characteristics corresponding to the prosodic standards to obtain rhythm characteristic evaluation results; and performing evaluation processing based on the pronunciation characteristic evaluation result and the rhythm characteristic evaluation result through a decision tree model to obtain the rhythm score of the voice data to be evaluated.

In view of this, embodiments of the present invention provide a speech prosody assessment method and apparatus based on artificial intelligence, an electronic device, and a storage medium, which are capable of obtaining a machine prosody score with high correlation with experts.

The following describes an exemplary application of the artificial intelligence based voice prosody assessment device provided by the embodiment of the present invention, and the artificial intelligence based voice prosody assessment device provided by the embodiment of the present invention may be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, and a portable game device), may also be implemented as a server or a server cluster, and may also be implemented in a manner that the user terminal and the server cooperate with each other. In the following, an exemplary application will be explained when the electronic device is implemented as a server.

Referring to fig. 1, fig. 1 is an alternative architecture diagram of a speech prosody evaluation system 100 based on artificial intelligence according to an embodiment of the present invention, in which a user terminal 400 (illustratively, a user terminal 400-1 and a user terminal 400-2) is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.

As shown in fig. 1, a user first opens a speech prosody evaluation client 410 on a user terminal 400, inputs a sentence or a text to be read, and reads according to the text. The voice prosody evaluation client 410 then transmits the text data input by the user and the collected voice data to the server 200 through the network 300. After receiving the text data and the voice data to be evaluated sent by the voice prosody evaluation client 410, the server 200 generates a prosody standard of pronunciation corresponding to the text data according to the received text data. And carrying out prosody detection processing on the received voice data to be evaluated to obtain pronunciation characteristics and rhythm characteristics of the voice data to be evaluated. After the pronunciation feature and the rhythm feature of the voice data to be evaluated are obtained, the pronunciation feature is compared with the standard pronunciation feature corresponding to the rhythm standard to obtain a pronunciation feature evaluation result, the rhythm feature is compared with the standard rhythm feature corresponding to the rhythm standard to obtain a rhythm feature evaluation result, evaluation processing based on the pronunciation feature evaluation result and the rhythm feature evaluation result is further performed through a decision tree model to obtain a rhythm score aiming at the voice data to be evaluated, and the obtained rhythm score is returned to the voice rhythm evaluation client 410 to realize evaluation of the pronunciation rhythm of the user.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 according to an embodiment of the present invention, taking an example in which a speech prosody evaluation device based on artificial intelligence is implemented as the server 200, where the server 200 shown in fig. 2 includes: at least one processor 210, memory 250, at least one network interface 220, and a user interface 230. The various components in terminal 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 250 described in embodiments of the invention is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 253 to enable presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 231 (e.g., a display screen, speakers, etc.) associated with the user interface 230;

an input processing module 254 for detecting one or more user inputs or interactions from one of the one or more input devices 232 and translating the detected inputs or interactions.

In some embodiments, the artificial intelligence based speech prosody assessment device provided by the embodiment of the present invention can be implemented in software, and fig. 2 shows an artificial intelligence based speech prosody assessment device 255 stored in a memory 250, which can be software in the form of programs and plug-ins, and the like, and includes the following software modules: the receiving module 2551, the prosody generating module 2552, the prosody detecting module 2553 and the prosody evaluating module 2554 are logical and thus may be arbitrarily combined or further divided according to the implemented functions. The functions of the respective modules will be explained below.

In other embodiments, the artificial intelligence based voice prosody evaluating Device provided by the embodiments of the present invention may be implemented in hardware, for example, the artificial intelligence based voice prosody evaluating Device provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to perform the artificial intelligence based voice prosody evaluating method provided by the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic elements.

The following describes an artificial intelligence based voice prosody assessment method provided by the embodiment of the present invention in conjunction with an exemplary application of the artificial intelligence based voice prosody assessment apparatus provided by the embodiment of the present invention when implemented as a server.

Referring to fig. 3, fig. 3 is an alternative flow chart of the artificial intelligence based speech prosody evaluation method according to the embodiment of the present invention, which will be described with reference to the steps shown in fig. 3.

In step S301, the client acquires voice data to be evaluated, and receives text data corresponding to the voice data to be evaluated.

In step S302, the server receives the voice data to be evaluated and the text data corresponding to the voice data to be evaluated.

In some embodiments, the voice data to be evaluated may be voice data of a user speaking freely or voice data of a user reading with respect to standard reference audio.

For example, a user may enter any sentence or piece of text to be spoken in a speech prosody assessment application, such as "I knew the fact, do you knew? And then clicking a start reading button to read according to the input text, clicking an end reading button after reading is finished, and finishing reading the sentence. Further, the speech prosody assessment application inputs the text data "I knock the fact, do you knock? And sending the collected voice data to a server.

For example, the voice prosody evaluation application may also provide some reference text data, and the user reads according to the reference text data provided by the voice prosody evaluation application, and then sends the reference text data and the voice data recorded by reading the user to the server.

In step S303, the server determines a prosody criterion of the text data corresponding to the pronunciation.

Here, the prosodic criteria include a standard pronunciation feature and a standard tempo feature, wherein the standard pronunciation feature includes a standard transcription position, a standard pause position, and a standard boundary key type.

In some embodiments, the server, after receiving the text data sent by the speech prosody assessment application, may analyze the structure and characteristics of the received text data through rules, natural language processing techniques, and generate prosody tags from accents, pauses, border keys, and the like.

For example, when sentence emphasis position prediction is performed, a sentence emphasis sequence labeling model may be constructed based on a large number of text data samples and the emphasis mode labels of the text data samples under pronunciation of a large number of native speakers. Because whether the words in the sentence are read again is associated with the words in the adjacent contexts to a certain extent, the context relationship needs to be considered when the model is built. Meanwhile, whether a certain word in a sentence is rereaded or not can also influence the prediction of rereading of adjacent words. Therefore, a bidirectional Long Short-term Memory model (LSTM) can be adopted to construct the rereading context information, and a conditional Random Field model (CRF) is added to construct the local dependency relationship of the sequence rereading label. And representing words in the text data sample into character vectors and word vectors, taking the character vectors and the word vectors as the input of the network, training the character vectors and the word vectors, extracting effective text characteristics through the network, inputting the effective text characteristics into a sequence labeling model, and training the model. And then, the standard rereading position corresponding to the received text data can be marked by utilizing the trained sentence rereading sequence marking model.

For example, when sentence pause position prediction is performed, text pause position judgment may be combined with a component syntax analysis tree, and received text data may be parsed into a component syntax analysis tree based on a component syntax analysis algorithm. And calculating the node distance of two adjacent leaf nodes based on the analytic constituent sentence method analysis tree, and extracting effective characteristics for judging the pause in the sentence. And training a classifier for judging whether the word level in the sentence is stopped or not based on a large amount of pronouncing and stopping labeling data of the native speakers by using the extracted effective characteristics and the part-of-speech characteristics of the words, such as nouns, verbs and other attributes. And marking the standard pause position corresponding to the received text data by using the trained classifier.

For example, when performing the boundary tone type prediction, since the sentence boundary tone type is mainly determined by the sentence pattern, for example: the border of a special question sentence is generally descended, the border of a common question sentence is generally ascended, the border of a general statement sentence is descended, and the like. Therefore, the sentence pattern of the input text can be identified in a keyword matching mode, and the standard boundary tone type corresponding to the received text data is output in combination with the rule.

For example, suppose that the text data received by the server is "I knock the fact, do you k n ow? ", then a corresponding standard pronunciation characteristic is: words know, fact and know need to be read again, pause is needed after fact, the boundary tone type of fact is descending, and the boundary tone type of sentence end know is ascending.

It should be noted that, for a speech, there may be a plurality of standard pronunciation features corresponding to the speech, and therefore, the embodiment of the present invention is not limited to the above-mentioned manner for evaluating the speech data to be evaluated, and other pronunciation manners may be considered for evaluating.

In some embodiments, the standard tempo feature may be characterized by a time-length difference coefficient between accented syllables in a prosodic standard of the corresponding pronunciation of the text data.

For example, the time difference between two adjacent stressed syllables in the prosody standard of the pronunciation corresponding to the text data is determined first, then the standard deviation of the time difference and the average value of the time difference are determined, and the quotient of the standard deviation and the average value is used as the time length difference coefficient between the stressed syllables in the prosody standard.

In step S304, the server performs prosody detection processing on the voice data to be evaluated to obtain pronunciation characteristics and rhythm characteristics of the voice data to be evaluated.

Here, the pronunciation characteristics include an actual rereading position, an actual pause position, and an actual boundary key type in the speech data to be evaluated.

In some embodiments, the server may align the received text data with the received speech data to be evaluated, and determine an actual rereading position, an actual pause position, and an actual boundary key type in the speech data to be evaluated according to an alignment result.

For example, when the re-reading position detection is performed, the feature extraction may be performed from a syllable angle. Syllable emphasis is mainly related to syllable pitch, tone intensity, pitch variation, tone intensity variation, syllable duration and the like. Thus, the relevant features of each syllable can be extracted: maximum pitch, minimum pitch, maximum intensity, minimum intensity, average intensity, amplitude of intensity rise and fall, amplitude of pitch rise and fall, duration of syllabic period, etc. Meanwhile, the characteristics are normalized in consideration of the fact that pitches and intensities of different users are not in a range. In addition, because whether the syllable is stressed or not is also related to other syllables in the word in which the syllable is located, the other syllable characteristics of the word are compared with the syllable characteristics to be used as stress characteristics. Based on the factors, the actual rereading position in the voice data to be evaluated is detected by simultaneously combining the previous word feature and the next word feature of the current word.

For example, when detecting a pause position, pauses in a sentence are mainly related to the silence duration after a word, and the speech speed normalization processing is performed on the silence duration first considering that the speech speeds of different users are different. In addition, since a word is often followed by a pause when the energy of the word is high, the pitch and intensity characteristics of the word and the statistics of the characteristics are combined as the characteristics for detecting the pause of the sentence. In addition, when the pitch or the intensity of sound of one word suddenly changes from another word, the mark is also generated by the pause, so that the change characteristics of the pitch and the intensity between adjacent words can be calculated as the characteristics for detecting the sentence pause. Based on the factors, the actual pause position in the voice data to be evaluated is detected by simultaneously combining the previous word feature and the next word feature of the current word.

For example, when boundary key type detection is performed, since the judgment of boundary keys by different users is not consistent, some judgment is based on the last word, some judgment is based on the last syllable of the word, and some judgment is based on the stressed syllable of the last word, the boundary key type detection can be performed by combining different levels of characteristics of phonemes, syllables and words. For example, the features of stressed syllables, the pronunciation features of unstressed syllables, and the pronunciation features at the phoneme and word levels can be extracted. Based on the factors, the actual boundary tone type in the voice data to be evaluated is detected by simultaneously combining the previous word feature and the next word feature of the current word.

It should be noted that, the step S303 and the step S304 have no fixed sequence, and may be executed sequentially according to an arbitrary sequence, or may be executed simultaneously.

In step S305, the server compares the pronunciation feature with a standard pronunciation feature corresponding to the prosodic standard to obtain a pronunciation feature evaluation result, and compares the rhythm feature with a standard rhythm feature corresponding to the prosodic standard to obtain a rhythm feature evaluation result.

Referring to fig. 4, fig. 4 is an optional flowchart of the artificial intelligence based speech prosody evaluation method according to the embodiment of the present invention, and in some embodiments, step S305 shown in fig. 3 may be implemented by steps S3051 to S3054 shown in fig. 4, which will be described in conjunction with the steps.

In step S3051, the server compares the re-reading position included in the pronunciation feature with the standard re-reading position included in the standard pronunciation feature to obtain a re-reading error rate of the pronunciation feature.

In some embodiments, the server compares the standard rereading position corresponding to the text data obtained in step S303 with the actual rereading position obtained in step S304, so as to obtain a rereading error rate of the pronunciation feature of the speech data to be evaluated.

For example, suppose that the text data input by the user is "I knock the fact, do you knock? "the standard re-reading positions marked in step S303 are knock, fact and knock, and the re-reading words obtained by detecting the voice data input by the user in step S304 are knock and fact only, so that the re-reading error rate of the user for the pronunciation features of the text data is 33.3%.

In step S3052, the server compares the pause position included in the pronunciation feature with the standard pause position included in the standard pronunciation feature to obtain a pause error rate of the pronunciation feature.

In some embodiments, the server compares the standard pause position corresponding to the text data obtained in step S303 with the actual pause position obtained in step S304, so as to obtain a pause error rate of the pronunciation characteristic of the speech data to be evaluated.

For example, suppose that the text data input by the user is "I knock the fact, do you knock? "the standard pause position marked in step S303 is a pause after the word fact, and the pause position obtained by detecting the voice data input by the user in step S304 is also after the word fact, so that the pause error rate of the user for the pronunciation feature of the text data is 0%.

In step S3053, the server compares the boundary key type included in the pronunciation feature with the standard boundary key type included in the standard pronunciation feature to obtain a boundary key type error rate of the pronunciation feature.

In some embodiments, the server compares the standard boundary key type corresponding to the text data obtained in step S303 with the actual boundary key type obtained in step S304, so as to obtain the boundary key type error rate of the pronunciation feature of the speech data to be evaluated.

For example, suppose that the text data input by the user is "I knock the fact, do you knock? "the standard boundary key type marked in step S303 is that the boundary key type of the word fact is descending, and the boundary key type of the sentence end know is ascending, whereas the boundary key type of the word fact obtained by detecting the voice data input by the user in step S304 is descending, and the boundary key type of the sentence end know is descending, and then the error rate of the boundary key type of the user for the pronunciation characteristics of the text data is 50%.

In step S3054, the server determines a time length difference coefficient between the accented syllables of the speech data to be evaluated, and determines a time length difference coefficient between the accented syllables in a prosody standard of the pronunciation corresponding to the text data; and based on the time length difference coefficient between the stressed syllables in the rhythm standard, normalizing the time length difference coefficient between the stressed syllables of the voice data to be evaluated, and determining the time length difference coefficient after normalization as a rhythm characteristic evaluation result.

In some embodiments, the server first determines a time difference between two adjacent stressed syllables in the prosody standard of the pronunciation corresponding to the text data, further calculates a standard deviation of the time difference and an average value of the time difference according to the determined time difference, and determines a quotient of the standard deviation and the average value as a time difference coefficient between the stressed syllables in the prosody standard of the pronunciation corresponding to the text data. Similarly, the server determines the time difference between two adjacent stressed syllables in the voice data to be evaluated, further determines the standard deviation of the time difference and the average value of the time difference according to the determined time difference, and determines the quotient of the standard deviation and the average value as the time length difference coefficient between the stressed syllables of the voice data to be evaluated. And after determining the time length difference coefficient between the accented syllables in the rhythm standard, calculating the standard deviation and the average value of the time length difference coefficient, normalizing the time length difference coefficient between the accented syllables of the voice data to be evaluated by using the standard deviation and the average value obtained by calculation, and taking the time length difference coefficient subjected to the normalization processing as the rhythm characteristic evaluation result of the voice data to be evaluated.

In step S306, the server performs evaluation processing based on the pronunciation feature evaluation result and the rhythm feature evaluation result through a decision tree model, so as to obtain a prosody score of the to-be-evaluated voice data.

Here, the evaluation process based on the pronunciation feature evaluation result and the rhythm feature evaluation result by the decision tree model may be an evaluation process based on the pronunciation feature evaluation result and the rhythm feature evaluation result by directly using the decision tree, or an evaluation process based on the pronunciation feature evaluation result and the rhythm feature evaluation result by an ensemble learning model composed of a decision tree, for example, a random forest model.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a decision tree model for speech prosody evaluation according to an embodiment of the present invention, taking the decision tree model as a single decision tree as an example, as shown in fig. 5, where the decision tree relates to node types including: a Root Node (Root Node) representing a set of voice data samples, which may be further divided into two or more subsets; a Decision Node (Decision Node), when a child Node is further split into a plurality of child nodes, the child Node is called a Decision Node; leaf nodes (Leaf/Terminal No de), nodes that cannot be further split are called Leaf nodes.

In some embodiments of constructing (training) a decision tree, the three stages include feature selection, decision tree generation, and decision tree pruning, which are described separately below.

In the feature selection stage, one feature is selected from a plurality of features related to a voice Data sample set as a standard of current node splitting, And how to select the features has different quantitative evaluation methods, so that different decision trees are derived, for example, an Input Data 3(ID3, Input Data3) algorithm selects the features through information gain, a C4.5 algorithm selects the features through information gain ratio, And a Classification Regression Tree (CART) algorithm selects the features through a Kini index. After the dataset is partitioned using a characteristic, each subset of data is purer (i.e., less uncertain) than the dataset D before the partitioning.

And in the generation stage of the decision tree, generating child nodes from top to bottom in a recursive manner according to the selected characteristic evaluation standard, and stopping the decision tree from growing until the data set is not separable. This process is a process of continually partitioning a data set into more pure, less deterministic subsets using features that meet the partition criteria. For each partitioning of the current data set, the purity of each subset after partitioning according to a certain characteristic is made higher, and the uncertainty is smaller.

In the cutting nodes of the decision tree, aiming at the condition that the decision tree is easy to over-fit, the redundant nodes in the tree structure are reduced by pruning so as to reduce the scale of the decision tree and relieve the phenomenon of over-fit.

In some embodiments, prosodic evaluation is performed using an integrated learning model that includes a plurality of decision trees, such as a random forest model, each decision tree in the random forest model independently performs evaluation of prosodic scores, and final scores are determined in a voting-based manner in conjunction with the scores of the plurality of decision trees, e.g., relative voting, i.e., the weights of each decision tree are the same, and the scores of each decision tree are added by weight as the final scores.

In step 307, the server issues the prosody score to the client for display.

Here, the prosodic score may be evaluated in percent, in quintile, or in any other degree, or may be scored using descriptive speech, such as good, general, bad, etc.

Continuing with the exemplary structure in which the artificial intelligence based speech prosody evaluation device 255 provided by the embodiment of the present invention is implemented as a software module, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based speech prosody evaluation device 255 of the memory 250 may include: a receiving module 2551, a prosody generating module 2552, a prosody detecting module 2553, and a prosody evaluating module 2554.

The receiving module 2551 is configured to receive voice data to be evaluated and text data corresponding to the voice data to be evaluated;

the prosody generating module 2552 is configured to generate a prosody standard of a pronunciation corresponding to the text data;

the prosody detection module 2553 is configured to perform prosody detection processing on the voice data to be evaluated to obtain a pronunciation feature and a rhythm feature of the voice data to be evaluated;

the prosody evaluation module 2554 is configured to compare the pronunciation feature with a standard pronunciation feature corresponding to the prosody standard to obtain a pronunciation feature evaluation result, and

the prosody evaluating module 2554 is further configured to perform evaluation processing based on the pronunciation feature evaluation result and the rhythm feature evaluation result, so as to obtain a prosody score of the to-be-evaluated voice data.

In some embodiments, the prosody generation module 2552 is further configured to generate a standard pronunciation feature and a standard rhythm feature corresponding to the text data;

In some embodiments, the prosody evaluation module 2554 is further configured to compare the transcription position included in the pronunciation feature with a standard transcription position included in the standard pronunciation feature to obtain a transcription error rate of the pronunciation feature;

In some embodiments, the prosody evaluation module 2554 is further configured to determine a time length difference coefficient between the stressed syllables of the speech data to be evaluated, and determine a time length difference coefficient between the stressed syllables in the prosody standard of the pronunciation corresponding to the text data;

In some embodiments, the prosody assessment module 2554 is further configured to determine a time difference between two adjacent stressed syllables in the speech data to be assessed;

In some embodiments, the prosody evaluation module 2554 is further configured to determine a standard deviation and an average of the time-length difference coefficients between the stressed syllables in the prosody standard;

In some embodiments, the prosody evaluation module 2554 is further configured to score the pronunciation feature evaluation result and the rhythm feature evaluation result through a node in a decision tree model, and perform weighting processing on the obtained score according to a weight of the node to obtain a prosody score of the voice data to be evaluated.

In some embodiments, the prosody generation module 2552 is further configured to label the corresponding position of the text data according to the pronunciation feature evaluation result, and return the labeled text data to the user terminal.

In some embodiments, the prosody evaluating module 2554 is further configured to obtain a voice data sample and a corresponding prosody score, and perform prosody detection on the voice data sample to obtain a corresponding pronunciation feature and a corresponding rhythm feature;

It should be noted that the description of the apparatus in the embodiment of the present application is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated. The inexhaustible technical details of the artificial intelligence based speech prosody evaluation device provided by the embodiment of the invention can be understood from the description of any one of the drawings of fig. 3-5 and 8.

In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described.

The related art generally performs prosody evaluation by comparing the similarity between a reference audio and an actual utterance when performing prosody evaluation of speech. However, this approach relies on a large amount of standard audio, which is not suitable for some scenarios such as free speech, limiting the applicability of the speech prosody assessment technique. In addition, the related art also provides a technical scheme for evaluating the prosody of the speaker by integrating prosody-related features, but the consideration is not comprehensive enough, and the relevance of the machine prosody score and an expert is low.

According to the embodiment of the invention, by researching the influence factors of rhythm evaluation and combining the differences of pronunciation rhythms of two speakers and a native speaker, a plurality of effective characteristics of the rhythm evaluation, such as stress error rate, pause error rate, boundary tone type error rate and the like, are extracted, and the effective characteristics of rhythm sense of the speaker are extracted. And (4) comprehensively considering the factors, and fitting the scores of the experts on the pronunciation rhythms of the two speakers so as to obtain the machine rhythm score with higher correlation degree with the experts.

Referring to fig. 6-7, fig. 6-7 are schematic diagrams illustrating a specific application scenario of the artificial intelligence based speech prosody assessment method according to the embodiment of the present invention. As shown in fig. 6, the user inputs a text to be read aloud, clicks the start aloud button, and starts aloud reading a sentence; and clicking the finish reading button to finish reading the sentence.

As shown in fig. 7, the prosody score and star level are displayed on the screen, and the prosody error details are fed back from several perspectives of re-reading, pause, and border adjustment. Wherein the score and the star level represent the final prosody score and are 0-100, and the higher the score is, the better the prosody is represented; higher star scales also indicate better prosody. Different colors reflect different error correction details of the prosody. For example, green is a correct mark such as correct reread, correct pause, and correct boundary tone, red is an error mark such as error reread, error pause, and error boundary tone, and orange is an error correction mark such as reread error correction, pause error correction, and boundary tone error correction. As shown in fig. 7, the upper part is the details of prosodic errors, and the lower part is the prosodic scores and the star rating. The sentence has a prosody score of 75, wherein the words with correct re-reading are know, fact and know, the words with incorrect re-reading are the, the words with correct pause are fact, the words with incorrect pause are the, the words with correct border tone are fact, and the words with incorrect border tone are know.

Referring to fig. 8, fig. 8 is a schematic flow chart of a method for evaluating speech prosody based on artificial intelligence according to an embodiment of the present invention. As shown in fig. 8, the method comprises the steps of:

1) the user firstly opens an Application (APP) and inputs a sentence or a section of English to be read;

2) clicking the recording of the APP to start;

3) clicking the APP to finish the recording;

4) the APP sends the text and the audio to a server;

5) the server sends the text to a rhythm generation module;

6) the server sends the audio to a rhythm detection module;

7) the prosody generating module sends the generated prosody standard to the prosody scoring module, and the prosody detecting module sends a prosody detecting result to the prosody scoring module;

8) the prosody scoring module is combined with the prosody standard generated by the prosody generating module and a prosody detection result of the prosody detecting module, outputs prosody scoring and returns the prosody scoring to the server;

9) and the server receives the final rhythm score, returns the rhythm score to the APP and displays the rhythm score to the user.

In some embodiments, prosodic scoring may be considered from two perspectives: 1. and re-reading, pausing and adjusting the boundary correctly. 2. Whether there is a tempo feeling. Based on the two angles and based on a tree model (namely, a prosody scoring model), fitting the prosodic features of the two speakers to expert scores (namely, fitting the prosodic features of the two speakers to corresponding prosodic features in a prosodic standard, so as to obtain expert scores corresponding to the prosodic features fitted in the prosodic standard as prosody scores), and finally outputting machine prosody scores.

1) Rereading, pausing, boundary tone determination

The rhythm generation module is mainly based on the rhythm standard of a given text output by a text, based on the angle of repeated reading, and by taking syllables as a unit, the rhythm generation module can be divided into: 1. must be read again, 2, must be read lightly, 3, can be read lightly or read heavily. Based on the pause angle, the method can be divided into the following steps by taking words as units: 1. the system has to stop, 2, 3, can stop and can not stop. Based on the boundary angle modulation, in units of sentences, the method can be divided into the following steps: 1. must raise, 2, must not raise, 3, can raise or not raise. And inputting the audio into a rhythm detection module, and outputting the rereading position, the pause position and the boundary tone type of the actual pronunciation of the speaker. And comparing the actual pronunciation rhythm with the standard pronunciation rhythm, and respectively calculating the re-reading error rate, the pause error rate and the boundary tone error rate to obtain the evaluation characteristics of re-reading, pause and boundary tone.

2) Judgment of rhythm

Since English is a language with stress and isochronism and Chinese is a language with syllable and isochronism, the pronunciation habits of the two languages have great difference. The pronunciation of two speakers is easily affected by the mother speaker. The inventor finds that the rhythm sense of sentences of two speakers is not strong enough in the process of implementing the embodiment of the invention, and the accent syllables and the weak syllables have no sharp contrast in tone intensity. Based on the above phenomena, the following rhythm characteristic evaluation methods are further implemented: the time length difference coefficient between the stressed syllables of the pronunciation of the native speaker and the stressed syllables of the bilingual speaker is calculated, wherein the time length difference coefficient is shown in formula (1).

diff＝[t1-t0,t2-t1…tn-t(n-1)](2)

The formula (1) is the time length difference coefficient of the stressed syllables, and the quotient of the standard deviation of the time length difference and the average value of the time length difference is calculated as the result of the time length difference coefficient of the stressed syllables. Where diff represents the time difference array of stressed syllables in a sentence, and t0 to tn are the time values of the 0 th to nth stressed syllables in a sentence. mean (diff) is the mean time difference, std (diff) is the standard deviation of the stressed syllable time difference. n represents the number of stressed syllables in a sentence.

Fig. 9 is a histogram of difference distribution of pronunciation rereading durations of two speakers and a histogram of difference distribution of pronunciation rereading durations of a native speaker according to an embodiment of the present invention. As shown in fig. 9, the time length difference coefficient distribution of the native speaker is significantly different from the time length difference coefficient distribution of the two speakers, wherein black is a histogram of the time length difference distribution of pronunciation and re-reading of the two speakers, gray is a histogram of the time length difference distribution of pronunciation and re-reading of the native speaker, the horizontal axis represents the normalized value of the time length difference of re-reading, and the vertical axis represents the statistical frequency. As can be seen from fig. 9, the difference distribution of pronunciation rereading durations of the two speakers has a certain difference from the difference distribution of pronunciation rereading durations of the native speakers. And (3) calculating the position of the pronunciation isochronism of the two speakers in the pronunciation isochronism distribution of the mother speaker, and calculating as shown in a formula (5), thereby finally obtaining the rhythm sense evaluation score of the two speakers.

Where, var _ coef (non _ native) is a time length difference coefficient between accented syllables pronounced by a bilingual, mean (var _ coef (native)) is an average value of time length difference coefficients between accented syllables pronounced by a native speaker, std (var _ coef (native)) is a standard deviation of time length difference coefficients between accented syllables pronounced by a native speaker. And (5) calculating the rhythm evaluation, calculating the average value and the standard deviation of the time length variation coefficient of the mother speaker, and normalizing the time length variation coefficient of the stressed syllables of the two speakers based on the average value and the standard deviation value to obtain rhythm evaluation scores.

3) Comprehensive scoring of rhythm

And establishing a tree model and fitting an expert score based on the rereading error rate, the boundary debugging error rate and the pause error rate and in combination with factors such as rhythm sense. After the training of the tree model is completed, each node in the tree is finally similar to the rules with different weights, and the prosody total score is output based on the rules.

The voice data sample provided by the embodiment of the invention is scored by three experts together and divided into five grades, and the pronunciation rhythm is very poor to very good. The consistency of the three-person labeling is about 0.73. The voice data samples are 1000 data, the correlation degree of machine prosody scoring and expert scoring is evaluated, the correlation degree is defined as a formula (6), and the correlation degree can reach 0.87. Experiments prove that the extracted prosodic features have strong correlation with expert scoring.

Wherein, the formula (6) is Pearson correlation coefficient, reflecting the correlation of the scores of two persons, wherein X represents the scoring sequence of X-labeler, Y represents the scoring sequence of Y-labeler, X _iDenotes the x-labeler's score, y, for the ith sample _iDenotes the score, μ, of the ith sample by the y-labeler _xAverage score, μ, of the x-markers _yFlat for y-labelerAnd (5) dividing equally, wherein n is the number of the scoring samples.

Because the searching for prosody-related features requires professional linguistic knowledge, in other embodiments, an end-to-end model from audio to prosody scoring can be built based on a large amount of prosody labeling data, and the related features of the voice data are automatically extracted through a network.

Embodiments of the present invention provide a storage medium storing executable instructions, which when executed by a processor, will cause the processor to perform the artificial intelligence based voice prosody assessment method provided by embodiments of the present invention, for example, the method shown in fig. 3-5, 8.

In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the invention has the following beneficial effects:

The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. A speech prosody assessment method based on artificial intelligence is characterized by comprising the following steps:

2. The method of claim 1, wherein determining prosodic criteria for the pronunciation corresponding to the text data comprises:

determining standard pronunciation characteristics and standard rhythm characteristics corresponding to the text data;

3. The method according to claim 1, wherein the comparing the pronunciation features with the corresponding standard pronunciation features in the prosodic standard to obtain a pronunciation feature evaluation result comprises:

comparing the re-reading position included in the pronunciation feature with the standard re-reading position included in the standard pronunciation feature to obtain the re-reading error rate of the pronunciation feature;

4. The method according to claim 1, wherein comparing the rhythm characteristic with a standard rhythm characteristic corresponding to the prosodic standard to obtain a rhythm characteristic evaluation result comprises:

determining a time length difference coefficient between the stressed syllables of the voice data to be evaluated, and determining a time length difference coefficient between the stressed syllables in a rhythm standard of corresponding pronunciation of the text data;

5. The method of claim 4, wherein determining the time-length difference coefficient between the stressed syllables of the speech data to be evaluated comprises:

determining the time difference between two adjacent stressed syllables in the voice data to be evaluated;

6. The method according to claim 4, wherein the normalizing the time-length difference coefficient between the accented syllables of the speech data to be evaluated based on the time-length difference coefficient between the accented syllables in the prosody standard comprises:

determining a standard deviation and an average value of time length difference coefficients between the stressed syllables in the rhythm standard;

7. The method according to claim 1, wherein the performing, by a decision tree model, an evaluation process based on the pronunciation feature evaluation result and the rhythm feature evaluation result to obtain a prosody score of the speech data to be evaluated comprises:

scoring the pronunciation characteristic evaluation result and the rhythm characteristic evaluation result through a node in a decision tree model;

and carrying out weighting processing on the obtained scores according to the weights of the nodes to obtain the prosodic scores of the voice data to be evaluated.

8. The method of claim 1, further comprising:

and marking the corresponding position of the text data according to the pronunciation characteristic evaluation result, and returning the marked text data to the user terminal.

9. The method according to any one of claims 1 to 8, wherein before the evaluation processing based on the pronunciation feature evaluation result and the rhythm feature evaluation result is performed by a decision tree model, the method further comprises:

acquiring a voice data sample and a corresponding prosody score, and performing prosody detection processing on the voice data sample to obtain a corresponding pronunciation characteristic and a corresponding rhythm characteristic;

10. An apparatus for artificial intelligence based prosody assessment of speech, the apparatus comprising: