CN113035237B

CN113035237B - Voice evaluation method and device and computer equipment

Info

Publication number: CN113035237B
Application number: CN202110272416.9A
Authority: CN
Inventors: 刘博卿; 王健宗; 张之勇; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2023-03-28
Anticipated expiration: 2041-03-12
Also published as: CN113035237A

Abstract

The application relates to the field of artificial intelligence and discloses a voice evaluation method, which comprises the following steps: acquiring appointed voice input by a user; inputting the specified voice into a voice recognition system, and acquiring acoustic features corresponding to the specified voice; inputting the specified voice into a text feature extraction model, translating the specified voice into a specified text, and extracting text features of the specified text; inputting the acoustic features and the text features into a forward neural network; acquiring output values corresponding to all output nodes of the forward neural network; and taking the maximum output value as an evaluation result corresponding to the specified voice. The voice recognition system is trained through the voice data corresponding to the accurate screening evaluation scene, the influence of environmental noise is reduced, and through acquiring the text characteristics and the acoustic characteristics of sentences, the analysis data of a forward neural network in the evaluation system is formed in a multi-dimensional mode, so that the effect of accurately evaluating voice is achieved.

Description

Voice evaluation method and device and computer equipment

Technical Field

The application relates to the field of artificial intelligence, in particular to a voice evaluation method, a voice evaluation device and computer equipment.

Background

Because of not being limited by regions, online teaching is more and more popular, but online teaching can not reach the purpose of off-line teaching assessment, and the online teaching can not be better assisted to be carried out with high quality. For example, the pronunciation accuracy and fluency of the learning language of students cannot be evaluated, especially in the field of evaluation of oral english language of children. Furthermore, due to the special environment of online teaching and the divergent characteristics of children learning, the mixed situation of Chinese and English languages often occurs, and the influence of the noise of the online background environment prevents accurate recognition of the non-native pronunciation of children and accurate positioning of the pronunciation of children including a lot of pauses and incomplete pronunciation words, and the like, so that the teaching target of the online language cannot be achieved.

Disclosure of Invention

The application mainly aims to provide a voice evaluation method and aims to solve the technical problem that the pronunciation of online voice cannot be recognized and evaluated in time.

The application provides a voice evaluation method, which comprises the following steps:

acquiring appointed voice input by a user;

inputting the specified voice into a voice recognition system, and acquiring acoustic features corresponding to the specified voice;

inputting the specified voice into a text feature extraction model, translating the specified voice into a specified text, and extracting text features of the specified text;

inputting the acoustic features and the text features into a forward neural network;

acquiring output values respectively corresponding to each output node of the forward neural network;

and taking the maximum output value as an evaluation result corresponding to the specified voice.

Preferably, the step of inputting the specified speech into the speech recognition system and acquiring the acoustic features corresponding to the specified speech includes:

training a time delay neural network through a voice data training set to obtain a basic model;

acquiring a first language data set and a second language data set corresponding to a specified language, wherein the first language data set is formed by collecting the voice of a speaker taking the specified language as a parent language, and the second language data set is formed by collecting the voice of the specified language of the speaker in an evaluation region;

and fine-tuning the basic model through an objective function according to the first language data set to obtain a first acoustic model, and fine-tuning the basic model through the objective function according to the second language data set to obtain a second acoustic model.

Preferably, the step of inputting the specified voice into a voice recognition system and acquiring the acoustic feature corresponding to the specified voice includes:

inputting the specified speech into the first acoustic model and the second acoustic model, respectively;

acquiring first data corresponding to the specified voice output by the first acoustic model, and acquiring second data corresponding to the specified voice output by the second acoustic model, wherein the first data and the second data both comprise phonemes corresponding to each frame, and recognition probabilities corresponding to the phonemes of each frame;

and calculating the acoustic characteristics corresponding to the specified voice according to the first data and the second data.

Preferably, the step of calculating the acoustic feature corresponding to the specified voice according to the first data and the second data includes:

acquiring a first voice frame and a first mute frame corresponding to the specified voice in the first data, and acquiring a second voice frame and a second mute frame corresponding to the specified voice in the second data;

calculating the average value of the first voice frame and the second voice frame to be used as a total voice frame corresponding to the appointed voice, and calculating the average value of the first mute frame and the second mute frame to be used as a total mute frame corresponding to the appointed voice;

determining a first byte sequence corresponding to the specified voice according to the phonemes corresponding to the first voice frame and the first mute frame respectively, and determining a second byte sequence corresponding to the specified voice according to the phonemes corresponding to the second voice frame and the second mute frame respectively;

and respectively calculating the editing distance and the confidence coefficient difference corresponding to the specified voice according to the first byte sequence and the second byte sequence.

Preferably, the step of inputting the specified speech into the text feature extraction model, translating the specified speech into the specified text, and extracting the text feature of the specified text is preceded by the steps of:

acquiring a scene field corresponding to a current evaluation task;

screening a training set sample corresponding to the text feature extraction model according to the scene field;

and training a plurality of ngram language models on the training set samples to form the text feature extraction model.

Preferably, the step of inputting the specified speech into a text feature extraction model, translating the specified speech into a specified text, and extracting a text feature of the specified text includes:

inputting the appointed voice into an appointed ngram language model to obtain a first text corresponding to the appointed voice, wherein the appointed ngram language model belongs to any one of a plurality of ngram language models, and the first text is any one of the appointed texts;

acquiring a first word existing in a preset dictionary in the specified voice and a second word not existing in the preset dictionary in the specified voice according to the first text;

calculating a first text characteristic corresponding to the specified text according to the first word and the second word, wherein the first text characteristic is any one of the text characteristics corresponding to the specified text;

and according to the calculation process of the first text characteristic, calculating text characteristics obtained by respectively translating the specified voice by a plurality of ngram language models.

Preferably, the step of inputting the acoustic features and the text features into a forward neural network is preceded by:

calculating the sum of the feature quantity of the acoustic features and the feature quantity of the text features;

constructing neurons of each neural network layer in the forward neural network according to the sum;

acquiring training data of the grading labels;

and training the forward neural network under a linear rectification activation function through the training data marked by the scores.

The application also provides a voice evaluation device, including:

the first acquisition module is used for acquiring specified voice input by a user;

the second acquisition module is used for inputting the specified voice into a voice recognition system and acquiring acoustic characteristics corresponding to the specified voice;

the translation module is used for inputting the specified voice into the text feature extraction model, translating the specified voice into a specified text and extracting the text feature of the specified text;

an input module for inputting the acoustic features and the text features into a forward neural network;

a third obtaining module, configured to obtain output values corresponding to each output node of the forward neural network;

and the module is used for taking the maximum output value as the evaluation result corresponding to the specified voice.

The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the above method when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as described above.

This application reduces the influence of environmental noise through the speech data training speech recognition system that accurate screening appraisal scene corresponds to through the text characteristic and the acoustics characteristic that acquire the statement, the multidimension degree forms the analytic data of preceding neural network among the appraisal system, realizes the effect of accurate aassessment pronunciation.

Drawings

FIG. 1 is a schematic flow chart of a speech assessment method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a voice evaluation system according to an embodiment of the present application;

fig. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, a voice evaluation method according to an embodiment of the present application includes:

s1: acquiring appointed voice input by a user;

s2: inputting the specified voice into a voice recognition system, and acquiring acoustic features corresponding to the specified voice;

s3: inputting the specified voice into a text feature extraction model, translating the specified voice into a specified text, and extracting text features of the specified text;

s4: inputting the acoustic features and the text features into a forward neural network;

s5: acquiring output values respectively corresponding to each output node of the forward neural network;

s6: and taking the maximum output value as an evaluation result corresponding to the specified voice.

The method and the device have the advantages that the feature extraction is carried out on the translated text through the text feature extraction model arranged in the speech recognition system in front, the speech recognition system is used for recognizing the speech to obtain the acoustic feature, and finally the text feature and the acoustic feature are evaluated through the forward neural network. The text features include but are not limited to the richness of words, grammar accuracy, pronunciation quality, sentence smoothness, semantic and relevance of prompt questions, etc. The acoustic features include, but are not limited to, the number of speech frames, the number of silence frames, etc., and are calculated from the probabilities output by the speech recognition system.

The speech recognition system of the embodiment of the application is a multilingual speech recognition system and comprises an acoustic model and a language model. This application takes as an example the evaluation of english pronunciations for speakers whose native language is chinese. The voice recognition system is a Chinese-English bilingual voice recognition system, and two parallel-used Chinese-English bilingual voice recognition systems are obtained by training different training data simultaneously, one Chinese-English bilingual voice recognition system is trained by using English data spoken by a Chinese speaker, the other Chinese-English bilingual voice recognition system is trained by using English data spoken by a English speaker, and acoustic characteristics are calculated through the output of the two Chinese-English bilingual voice recognition systems.

In the training data of the speech recognition system, the training text of the training language model comprises an English text and a Chinese text, wherein the English text comprises an English textbook text, an English level test text and English speech marking data; the Chinese text comprises Chinese textbook text, chinese level test text and Chinese voice labeling data. The language model 3-gram model of the present application. The data for training the acoustic model is specially marked with substandard data, and the substandard data is considered as noise in the training process so as to improve the training effect, including but not limited to marking only the main speaker, if other speakers can hear clearly, marking as [ & ], "-hear low-sound speech which is not too clear, marking as [ # ],", marking words which have poor pronunciation with each pronunciation unit as [ # ], "and marking where there is an english conversion in the Chinese. And the acoustic model is trained by using the data in the evaluation scene, so that the interference condition of environmental noise or human noise is effectively solved, the data in the scene more conforms to the noisy condition in a real scene, and the model is more robust to noise.

This application reduces the influence of ambient noise through the speech data training speech recognition system that accurate screening appraisal scene corresponds to through the text characteristic and the acoustics characteristic that acquire the statement, the multidimension degree forms the analytic data to neural network among the appraisal system, realizes the effect of accurate aassessment pronunciation.

Further, the voice recognition system includes an acoustic model, and before the step S2 of inputting the specified voice into the voice recognition system and acquiring the acoustic feature corresponding to the specified voice, the method includes:

s201: training a time delay neural network through a voice data training set to obtain a basic model;

s202: acquiring a first language data set and a second language data set corresponding to a specified language, wherein the first language data set is formed by collecting the voice of a speaker taking the specified language as a parent language, and the second language data set is formed by collecting the voice of the specified language of the speaker in an evaluation region;

s203: and finely adjusting the basic model through an objective function according to the first language data set to obtain a first acoustic model, and finely adjusting the basic model through the objective function according to the second language data set to obtain a second acoustic model.

The acoustic model of the application adopts a model structure of TDNN (Time Delay Network), the input characteristics are 40-dimensional MFCC (Mel Frequency Cepstrum Coefficient) and 100-dimensional vector, the training data is subjected to velocity disturbance, and the training mode adopts LFMMI (Lattice-Free Maximum Mutual Information). Firstly, training is carried out by utilizing open source Chinese and English data to generate a basic model, and then, fine tuning is carried out on the basic model by utilizing a smbr (minimum state Bayes risk) target function by utilizing English speaking data of a speaker with a mother language of Chinese to obtain an acoustic model.

Further, the step S2 of inputting the specified voice into the voice recognition system and acquiring the acoustic feature corresponding to the specified voice includes:

s21: inputting the specified speech into the first acoustic model and the second acoustic model, respectively;

s22: acquiring first data corresponding to the specified voice output by the first acoustic model, and acquiring second data corresponding to the specified voice output by the second acoustic model, wherein the first data and the second data respectively comprise phonemes corresponding to each frame, and recognition probabilities corresponding to the phonemes of each frame;

s23: and calculating the acoustic characteristics corresponding to the specified voice according to the first data and the second data.

This application says the acoustic model that the english trained comes out for chinese of chinese through the native language, says the acoustic model that the english trained comes out for the people of english with the native language, handles the pronunciation that awaits measuring jointly, and two acoustic models all can export the phoneme that each frame of pronunciation that awaits measuring corresponds to and the probability value of each frame recognized the phoneme, calculate the relevant characteristic of the acoustics of the pronunciation that awaits measuring through the probability value that two acoustic models exported respectively.

Further, the step S23 of calculating the acoustic feature corresponding to the specified voice according to the first data and the second data includes:

s231: acquiring a first voice frame and a first mute frame corresponding to the specified voice in the first data, and acquiring a second voice frame and a second mute frame corresponding to the specified voice in the second data;

s232: calculating the average value of the first voice frame and the second voice frame to be used as a total voice frame corresponding to the appointed voice, and calculating the average value of the first mute frame and the second mute frame to be used as a total mute frame corresponding to the appointed voice;

s233: determining a first byte sequence corresponding to the specified voice according to the phonemes corresponding to the first voice frame and the first mute frame respectively, and determining a second byte sequence corresponding to the specified voice according to the phonemes corresponding to the second voice frame and the second mute frame respectively;

s234: and respectively calculating the editing distance and the confidence difference corresponding to the specified voice according to the first byte sequence and the second byte sequence.

The acoustic features of the present application include "the total number of frames of a speech frame excluding a mute frame", "the number of frames of a mute frame", "the phoneme level-based edit distance of two acoustic model outputs", and "the difference between the confidence levels of two acoustic model outputs for speech". The total frame number of the voice frames without the mute frames represents the length of the sentence; the frame number of the mute frame represents the speaking fluency, for example, the speaking English is not fluent and has longer or more pauses, and the pauses correspond to the mute frames; the editing distance based on the phoneme level output by the two acoustic models and the difference between the confidence degrees of the two acoustic models to the speech can reflect the standard property of pronunciation laterally, if the character strings of the phonemes recognized by the two systems are the same, the pronunciation standard is indicated, and if the character strings are not standard, the pronunciation standard is indicated. The edit distance is obtained by calculating the Hamming distance of the character strings output by the two models, and the confidence difference is obtained by obtaining the confidence values of the two output character strings.

Further, before the step S3 of inputting the specified speech into the text feature extraction model, translating the specified speech into the specified text, and extracting the text feature of the specified text, the method includes:

s31: acquiring a scene field corresponding to a current evaluation task;

s32: screening a training set sample corresponding to the text feature extraction model according to the scene field;

s33: and training a plurality of ngram language models on the training set samples to form the text feature extraction model.

The text feature extraction model of the embodiment of the application is a plurality of ngram language models. The training set of the text feature extraction model consists of texts outside the scene domain and texts inside the scene domain. Taking an English learning scene of a child as an example of a scene field, texts outside the scene field comprise English speeches and English news reports, and texts in the scene field comprise writing and reading English texts with highest scores. The plurality of ngram language models provided by this embodiment are 4 ngrams, and are language models of n =1, 2, 3, and 4, respectively, and the training set of the text feature extraction model includes 3 training sets, including a training set formed by a text outside the scene domain, a training set formed by a text inside the scene domain, and a question text having the same question in the scene domain), and this embodiment obtains 3 × 4=12 language models.

Further, the step S3 of inputting the specified voice into the text feature extraction model, translating the specified voice into a specified text, and extracting a text feature of the specified text includes:

s301: inputting the appointed voice into an appointed ngram language model to obtain a first text corresponding to the appointed voice, wherein the appointed ngram language model belongs to any one of a plurality of ngram language models, and the first text is any one of the appointed texts;

s302: acquiring a first word existing in a preset dictionary in the specified voice and a second word not existing in the preset dictionary in the specified voice according to the first text;

s303: calculating a first text characteristic corresponding to the specified text according to the first word and the second word, wherein the first text characteristic is any one of the text characteristics corresponding to the specified text;

s304: and according to the calculation process of the first text characteristic, calculating text characteristics obtained by respectively translating the specified voice by a plurality of ngram language models.

In the embodiment of the application, each ngram language model respectively translates the specified voice to obtain a group of text features. Each set of text features includes, but is not limited to

N-N _bo And M, wherein N represents the number of words in the speech, P represents the probability value of all words in the speech, poov represents the probability value of words in the speech that are not in the dictionary, N _bo Representing the number of backspacing, M represents the number of words in speech that are not in the dictionary. />

The mean value of the logarithmic probability of a sentence represents the smoothness degree of the sentence, the probability value is low, and the smoothness is poor. />

The mean value of the contribution of a word not in the dictionary to the logarithm probability of a whole sentence is shown, and is generally the probability influence corresponding to a misspoken word. />

The mean value of the difference between the log-probabilities represents an indication of whether the remaining words are smooth, except for the misspoken word. N is a radical of _bo Of language models to input statementsThe backspace number, the difference between the backspace number and the backspace number represents how many words exist, the backspace number appears in the ngram input test statement and also appears in the training set, because the ngram language model is a backspace language model, in order to ensure that the language combination which does not appear in the training set also has corresponding probability, a conventional parameter N is set _bo . With comprehensive training data, the more standard the answer sentence is, the more word combinations will appear in the training set, N _bo Will be low, otherwise N _bo It will be high. M represents the number of words not in the dictionary, the spoken language is not standard, many words cannot be correctly recognized, and the recognition system can form a corresponding most probable word according to the phonemes of the English, wherein the word is not in the dictionary. So a higher value of M indicates less standard spoken language.

For example, each ngram model obtains 5 text features, the ngram models with 12 parameters are obtained through training by using 3 data sets and 4 ngram model structures, and then each sentence can obtain 60 text features, so that the dimensional range of data analysis is improved, and the accuracy of voice evaluation is improved.

In a characteristic embodiment of the present application, the text characteristics further include "total word number N in a sentence", "proportion of words not in a dictionary to the total word number", "number of occurrences of english words", and "number of occurrences of chinese words", so as to further improve accuracy of the speech evaluation.

Further, before the step S4 of inputting the acoustic feature and the text feature into the forward neural network, the method includes:

s41: calculating the sum of the feature quantity of the acoustic features and the feature quantity of the text features;

s42: constructing neurons of each neural network layer in the forward neural network according to the sum;

s43: acquiring training data of the grading labels;

s44: and training the forward neural network under a linear rectification activation function through the training data marked by the scores.

The training data of the scoring labels of the embodiment of the application is formed by labeling the data by experts, and the experts grade each sentence into three grades of {0,1,2}, wherein 0 represents no passing, 1 represents poor passing and 2 represents passing. The output nodes matching the three levels of the forward neural network are 3, which respectively represent 3 levels, and which output node has the largest value, so that the level is the level represented by the node.

The forward neural network of the present application is composed of 3 layers, the number of neurons in each layer is a feature dimension, and the number of text features and acoustic features obtained in the embodiment of the present application is 68, so that the number of neurons in each layer is 68, the activation function is a ReLU (Linear rectification function), the optimizer is an SGD algorithm (Stochastic Gradient Descent) and an AdaGrad adaptive Gradient algorithm, and the learning rate is 0.05.

Referring to fig. 2, a voice evaluation apparatus according to an embodiment of the present application includes:

the first acquisition module 1 is used for acquiring the specified voice input by the user;

the second obtaining module 2 is used for inputting the specified voice into a voice recognition system and obtaining acoustic features corresponding to the specified voice;

the translation module 3 is used for inputting the specified voice into the text feature extraction model, translating the specified voice into a specified text and extracting the text feature of the specified text;

an input module 4, configured to input the acoustic features and the text features into a forward neural network;

a third obtaining module 5, configured to obtain output values corresponding to each output node of the forward neural network;

and the module 6 is used for taking the maximum output value as the evaluation result corresponding to the specified voice.

The relevant explanation of the embodiments of the present application, the explanation of the corresponding parts of the applicable method, are not repeated.

Further, the speech recognition system comprises an acoustic model and a speech evaluation device, and the speech evaluation device comprises:

the first training module is used for training the time delay neural network through a voice data training set to obtain a basic model;

the fourth acquisition module is used for acquiring a first language data set and a second language data set corresponding to the specified languages, wherein the first language data set is formed by collecting the voice of a speaker taking the specified languages as a parent language, and the second language data set is formed by collecting the voice of the specified languages of the speaker in the region to be evaluated;

and the fine tuning module is used for fine tuning the basic model through an objective function according to the first language data set to obtain a first acoustic model, and fine tuning the basic model through the objective function according to the second language data set to obtain a second acoustic model.

Further, the second obtaining module 2 includes:

a first input unit configured to input the specified speech into the first acoustic model and the second acoustic model, respectively;

a first obtaining unit, configured to obtain first data corresponding to the specified voice output by the first acoustic model, and obtain second data corresponding to the specified voice output by the second acoustic model, where the first data and the second data each include a phoneme corresponding to each frame and a recognition probability corresponding to each frame of phonemes;

and the first calculating unit is used for calculating the acoustic characteristics corresponding to the specified voice according to the first data and the second data.

Further, the first calculation unit includes:

an obtaining subunit, configured to obtain a first speech frame and a first silence frame corresponding to the specified speech in the first data, and obtain a second speech frame and a second silence frame corresponding to the specified speech in the second data;

a first calculating subunit, configured to calculate an average value of the first speech frame and the second speech frame, as a total speech frame corresponding to the specified speech, and calculate an average value of the first silence frame and the second silence frame, as a total silence frame corresponding to the specified speech;

a determining subunit, configured to determine, according to phonemes corresponding to the first speech frame and the first silence frame, a first byte sequence corresponding to the specified speech, and determine, according to phonemes corresponding to the second speech frame and the second silence frame, a second byte sequence corresponding to the specified speech;

and the second calculating subunit is used for respectively calculating the editing distance and the confidence difference corresponding to the specified voice according to the first byte sequence and the second byte sequence.

Further, the voice evaluation device includes:

the fifth acquisition module is used for acquiring the scene field corresponding to the current evaluation task;

the screening module is used for screening a training set sample corresponding to the text feature extraction model according to the scene field;

and the composition module is used for training a plurality of ngram language models on the training set sample to form the text feature extraction model.

Further, the translation module 3 includes:

a second input unit, configured to input the specified speech into a specified ngram language model, so as to obtain a first text corresponding to the specified speech, where the specified ngram language model belongs to any one of multiple ngram language models, and the first text is any one of the specified texts;

a second obtaining unit, configured to obtain, according to the first text, a first word in the specified voice that exists in a preset dictionary, and a second word in the specified voice that does not exist in the preset dictionary;

a second calculating unit, configured to calculate, according to the first word and the second word, a first text feature corresponding to the specified text, where the first text feature is any one of text features corresponding to the specified text;

and the third calculating unit is used for calculating text characteristics obtained by respectively translating the specified voice by a plurality of ngram language models according to the calculation process of the first text characteristics.

Further, the voice evaluation device includes:

a calculation module for calculating a sum of the feature quantity of the acoustic features and the feature quantity of the text features;

a construction module for constructing neurons of each neural network layer in the forward neural network according to the sum;

the sixth acquisition module is used for acquiring the training data of the grading labels;

and the second training module is used for training the forward neural network under a linear rectification activation function through the training data marked by the grades.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store all data required for the speech assessment process. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech assessment method.

The processor executes the voice evaluation method, and the method comprises the following steps: acquiring appointed voice input by a user; inputting the specified voice into a voice recognition system, and acquiring acoustic features corresponding to the specified voice; inputting the specified voice into a text feature extraction model, translating the specified voice into a specified text, and extracting text features of the specified text; inputting the acoustic features and the text features into a forward neural network; acquiring output values respectively corresponding to each output node of the forward neural network; and taking the maximum output value as an evaluation result corresponding to the specified voice.

According to the computer equipment, the voice recognition system is trained through voice data corresponding to the accurate screening evaluation scene, the influence of environmental noise is reduced, and through acquiring the text characteristics and the acoustic characteristics of sentences, analysis data of a forward neural network in the evaluation system are formed in a multi-dimensional mode, so that the effect of accurately evaluating voice is achieved.

In one embodiment, the step of inputting the specified speech into the speech recognition system by the processor to obtain the acoustic feature corresponding to the specified speech includes: training a time delay neural network through a voice data training set to obtain a basic model; acquiring a first language data set and a second language data set corresponding to a specified language, wherein the first language data set is formed by collecting the voice of a speaker taking the specified language as a parent language, and the second language data set is formed by collecting the voice of the specified language of the speaker in an evaluation region; and finely adjusting the basic model through an objective function according to the first language data set to obtain a first acoustic model, and finely adjusting the basic model through the objective function according to the second language data set to obtain a second acoustic model.

In an embodiment, the step of inputting the specified speech into a speech recognition system by the processor and acquiring the acoustic feature corresponding to the specified speech includes: inputting the specified speech into the first acoustic model and the second acoustic model, respectively; acquiring first data corresponding to the specified voice output by the first acoustic model, and acquiring second data corresponding to the specified voice output by the second acoustic model, wherein the first data and the second data both comprise phonemes corresponding to each frame, and recognition probabilities corresponding to the phonemes of each frame; and calculating the acoustic characteristics corresponding to the specified voice according to the first data and the second data.

In an embodiment, the step of calculating, by the processor, the acoustic feature corresponding to the specified voice according to the first data and the second data includes: acquiring a first voice frame and a first mute frame corresponding to the specified voice in the first data, and acquiring a second voice frame and a second mute frame corresponding to the specified voice in the second data; calculating the average value of the first voice frame and the second voice frame to be used as a total voice frame corresponding to the appointed voice, and calculating the average value of the first mute frame and the second mute frame to be used as a total mute frame corresponding to the appointed voice; determining a first byte sequence corresponding to the specified voice according to the phonemes corresponding to the first voice frame and the first mute frame respectively, and determining a second byte sequence corresponding to the specified voice according to the phonemes corresponding to the second voice frame and the second mute frame respectively; and respectively calculating the editing distance and the confidence difference corresponding to the specified voice according to the first byte sequence and the second byte sequence.

In one embodiment, the above processor, before the steps of inputting the specified speech into a text feature extraction model, translating the specified speech into a specified text, and extracting the text feature extracted from the specified text, includes: acquiring a scene field corresponding to a current evaluation task; screening a training set sample corresponding to the text feature extraction model according to the scene field; and training a plurality of ngram language models on the training set samples to form the text feature extraction model.

In one embodiment, the step of inputting the specified speech into a text feature extraction model, translating the specified speech into a specified text, and extracting a text feature of the specified text by the processor includes: inputting the appointed voice into an appointed ngram language model to obtain a first text corresponding to the appointed voice, wherein the appointed ngram language model belongs to any one of a plurality of ngram language models, and the first text is any one of the appointed texts; acquiring a first word existing in a preset dictionary in the specified voice and a second word not existing in the preset dictionary in the specified voice according to the first text; calculating a first text characteristic corresponding to the specified text according to the first word and the second word, wherein the first text characteristic is any one of the text characteristics corresponding to the specified text; and according to the calculation process of the first text characteristic, calculating text characteristics obtained by respectively translating the specified voice by a plurality of ngram language models.

In one embodiment, the step of inputting the acoustic feature and the text feature into the forward neural network by the processor comprises: calculating the sum of the feature quantity of the acoustic features and the feature quantity of the text features; constructing neurons of each neural network layer in the forward neural network according to the sum; acquiring training data of the grading labels; and training the forward neural network under a linear rectification activation function through the training data marked by the scores.

It will be understood by those skilled in the art that the structure shown in fig. 3 is only a block diagram of a part of the structure related to the present application, and does not constitute a limitation to the computer device to which the present application is applied.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for performing a voice evaluation includes: acquiring appointed voice input by a user; inputting the specified voice into a voice recognition system, and acquiring acoustic features corresponding to the specified voice; inputting the specified voice into a text feature extraction model, translating the specified voice into a specified text, and extracting text features of the specified text; inputting the acoustic features and the text features into a forward neural network; acquiring output values respectively corresponding to each output node of the forward neural network; and taking the maximum output value as an evaluation result corresponding to the specified voice.

According to the computer-readable storage medium, the voice recognition system is trained by accurately screening the voice data corresponding to the evaluation scene, the influence of environmental noise is reduced, and the analysis data of the forward neural network in the evaluation system is formed in a multi-dimensional manner by acquiring the text characteristics and the acoustic characteristics of the sentences, so that the effect of accurately evaluating the voice is realized.

In one embodiment, the step of inputting the specified speech into the speech recognition system by the processor to obtain the acoustic feature corresponding to the specified speech includes: training a time delay neural network through a voice data training set to obtain a basic model; acquiring a first language data set and a second language data set corresponding to a specified language, wherein the first language data set is formed by collecting the voice of a speaker taking the specified language as a mother language, and the second language data set is formed by collecting the voice of the specified language of the speaker in an evaluation region; and fine-tuning the basic model through an objective function according to the first language data set to obtain a first acoustic model, and fine-tuning the basic model through the objective function according to the second language data set to obtain a second acoustic model.

In one embodiment, the step of calculating, by the processor, the acoustic feature corresponding to the specified voice according to the first data and the second data includes: acquiring a first voice frame and a first mute frame corresponding to the specified voice in the first data, and acquiring a second voice frame and a second mute frame corresponding to the specified voice in the second data; calculating the average value of the first voice frame and the second voice frame to be used as a total voice frame corresponding to the appointed voice, and calculating the average value of the first mute frame and the second mute frame to be used as a total mute frame corresponding to the appointed voice; determining a first byte sequence corresponding to the specified voice according to the phonemes corresponding to the first voice frame and the first mute frame respectively, and determining a second byte sequence corresponding to the specified voice according to the phonemes corresponding to the second voice frame and the second mute frame respectively; and respectively calculating the editing distance and the confidence difference corresponding to the specified voice according to the first byte sequence and the second byte sequence.

In one embodiment, the step of inputting the specified speech into a text feature extraction model, translating the specified speech into a specified text, and extracting the extracted text feature of the specified text by the processor comprises: acquiring a scene field corresponding to a current evaluation task; screening a training set sample corresponding to the text feature extraction model according to the scene field; and training a plurality of ngram language models on the training set samples to form the text feature extraction model.

In one embodiment, the step of inputting the acoustic features and the text features into the forward neural network by the processor comprises: calculating the sum of the feature quantity of the acoustic features and the feature quantity of the text features; constructing neurons of each neural network layer in the forward neural network according to the sum; acquiring training data of the grading labels; and training the forward neural network under a linear rectification activation function through the training data marked by the scores.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (SSRDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of another identical element in a process, apparatus, article, or method comprising the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A speech assessment method, comprising:

acquiring appointed voice input by a user;

inputting the specified voice into a voice recognition system, and acquiring acoustic features corresponding to the specified voice; the acoustic features comprise the number of voice frames and the number of mute frames;

inputting the specified voice into a text feature extraction model, calculating and translating the specified voice into a specified text through the probability output by a voice recognition system, and extracting the text feature of the specified text; the text characteristics comprise the richness of words, the grammar accuracy, the pronunciation quality, the sentence smoothness, the semantics and the correlation degree of prompt questions;

taking the maximum output value as an evaluation result corresponding to the specified voice;

the voice recognition system comprises an acoustic model, and before the step of inputting the specified voice into the voice recognition system and acquiring the acoustic features corresponding to the specified voice, the method comprises the following steps:

fine-tuning the basic model through an objective function according to the first language data set to obtain a first acoustic model, and fine-tuning the basic model through the objective function according to the second language data set to obtain a second acoustic model;

the step of inputting the specified voice into a voice recognition system and acquiring the acoustic features corresponding to the specified voice comprises:

calculating acoustic features corresponding to the specified voice according to the first data and the second data;

the step of calculating the acoustic feature corresponding to the specified voice according to the first data and the second data includes:

calculating the average value of the first voice frame and the second voice frame to be used as a total voice frame corresponding to the appointed voice, and calculating the average value of the first mute frame and the second mute frame to be used as a total mute frame corresponding to the appointed voice; wherein the average values of the first and second speech frames and the average values of the first and second silence frames are "the total number of frames of speech frames excluding silence frames", "the number of frames of silence frames", respectively; the total frame number of the voice frames without the mute frames represents the length of a sentence, and the frame number of the mute frames represents the speaking fluency;

and respectively calculating the editing distance and the confidence difference corresponding to the specified voice according to the first byte sequence and the second byte sequence.

2. The speech assessment method according to claim 1, wherein the step of inputting the specified speech into the text feature extraction model, translating the specified speech into the specified text, and extracting the text feature extracted from the specified text is preceded by:

acquiring a scene field corresponding to a current evaluation task;

3. The speech evaluation method according to claim 2, wherein the step of inputting the specified speech into the text feature extraction model, translating the specified speech into the specified text, and extracting the text feature of the specified text comprises:

calculating a first text characteristic corresponding to the specified text according to the first word and the second word, wherein the first text characteristic is any one of the text characteristics corresponding to the specified text; wherein each set of text features includes

N-N _bo And M, wherein N represents the number of words in the voice, and P represents the probability value of all words in the voice; poov represents the summary of words in speech that are not in the dictionaryA value of the rate; n is a radical of _bo Indicating the number of backoffs; m represents the number of words in the speech which are not in the dictionary;

4. The speech assessment method according to claim 1, wherein said step of inputting said acoustic features and said textual features into a forward neural network is preceded by the steps of:

acquiring training data of the grading labels;

5. A speech evaluation device characterized by comprising:

the first acquisition module is used for acquiring the specified voice input by the user;

the second acquisition module is used for inputting the specified voice into a voice recognition system and acquiring acoustic characteristics corresponding to the specified voice; the acoustic features comprise the number of voice frames and the number of mute frames;

the translation module is used for inputting the specified voice into the text feature extraction model, calculating and translating the specified voice into a specified text through the probability output by the voice recognition system, and extracting the text feature of the specified text; the text characteristics comprise the richness of words, the grammar accuracy, the pronunciation quality, the sentence smoothness, the semantics and the correlation degree of prompt questions;

an input module, configured to input the acoustic features and the text features into a forward neural network;

a third obtaining module, configured to obtain output values corresponding to output nodes of the forward neural network;

the module is used for taking the maximum output value as the evaluation result corresponding to the specified voice;

the fine tuning module is used for fine tuning the basic model through an objective function according to the first language data set to obtain a first acoustic model, and fine tuning the basic model through the objective function according to the second language data set to obtain a second acoustic model;

a first obtaining unit, configured to obtain first data corresponding to the specified speech output by the first acoustic model, and obtain second data corresponding to the specified speech output by the second acoustic model, where the first data and the second data each include a phoneme corresponding to each frame and a recognition probability corresponding to each frame;

the first calculation unit is used for calculating the acoustic features corresponding to the specified voice according to the first data and the second data;

a first calculating subunit, configured to calculate an average value of the first speech frame and the second speech frame, as a total speech frame corresponding to the specified speech, and calculate an average value of the first silence frame and the second silence frame, as a total silence frame corresponding to the specified speech; wherein the average values of the first speech frame and the second speech frame and the average values of the first silence frame and the second silence frame are "the total number of frames of speech frames excluding silence frames" and "the number of frames of silence frames", respectively; the total frame number of the voice frames without the mute frames represents the length of a sentence, and the frame number of the mute frames represents the speaking fluency;

6. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 4 when executing the computer program.

7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.