CN112908359A

CN112908359A - Voice evaluation method and device, electronic equipment and computer readable medium

Info

Publication number: CN112908359A
Application number: CN202110132082.5A
Authority: CN
Inventors: 郭伟; 李轶杰; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-01-31
Filing date: 2021-01-31
Publication date: 2021-06-04

Abstract

The invention relates to a voice evaluation method, a voice evaluation device, electronic equipment and a computer readable medium. The method comprises the following steps: acquiring voice data to be evaluated and text data corresponding to the voice data; inputting the voice data into a first acoustic model and a second acoustic model respectively to obtain first acoustic data and second acoustic data; inputting the text data into a text network to generate auxiliary data; decoding based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score; determining an evaluation result of the voice data to be evaluated based on the first pronunciation quality score and the second pronunciation quality score. The speech evaluation method, the speech evaluation device, the electronic equipment and the computer readable medium can describe the accuracy of the phoneme level more accurately, so that the output evaluation information of words, sentences and paragraphs is more accurate.

Description

Voice evaluation method and device, electronic equipment and computer readable medium

Technical Field

The invention relates to the field of computer information processing, in particular to a voice evaluation method and device, electronic equipment and a computer readable medium.

Background

Pronunciation quality evaluation (Pronunciation Scoring) is to make a machine automatically evaluate the Pronunciation quality of target language voice, and can be widely used in spoken language teaching and spoken language examination systems. With the global integration and the increasing international level of China, the demand of Chinese people for English learning is rapidly increasing. However, due to the limitations of the learning environment and teaching conditions of the domestic English, the domestic English learner generally has the situation of difficulty in oral learning. With the development of computer science and technology and the progress of language teaching and learning method, the computer aided language learning technology makes it possible to solve the problem. The core of computer-aided language learning is the speech recognition and evaluation technology, which is the key. Due to the complex voice pronunciation change, the large data volume of the voice signal, the high dimensionality of the voice characteristic parameter and the large calculation amount of voice recognition and evaluation, the large-batch voice signal processing needs higher-requirement software and hardware resources and algorithms. The traditional speech recognition algorithm dynamic time warping algorithm, hidden Markov model and artificial neural network have advantages and disadvantages, meet unprecedented bottlenecks, and are difficult to further improve the accuracy and speed.

Therefore, a new voice evaluation method, apparatus, electronic device and computer readable medium are needed.

The above information disclosed in this background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of this, the present invention provides a speech evaluation method, apparatus, electronic device and computer readable medium, which can effectively consider the slowly varying information of phonemes required in speech evaluation and the distinguishing information between different phonemes, and can describe the accuracy of phoneme level more accurately, so as to make the output evaluation information at the level of words, sentences and paragraphs more accurate.

Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.

According to an aspect of the present invention, there is provided a voice evaluation method, including: acquiring voice data to be evaluated and text data corresponding to the voice data; inputting the voice data into a first acoustic model and a second acoustic model respectively to obtain first acoustic data and second acoustic data; inputting the text data into a text network to generate auxiliary data; decoding based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score; determining an evaluation result of the voice data to be evaluated based on the first pronunciation quality score and the second pronunciation quality score.

In an exemplary embodiment of the present invention, further comprising: training a deep neural network model through speech training data to generate the first acoustic model; wherein the first acoustic model is used for evaluating the change of phonemes in the voice data.

In an exemplary embodiment of the invention, the deep neural network model is trained based on cross-entropy criteria.

In an exemplary embodiment of the present invention, further comprising: training a time delay deep neural network model through voice training data to generate the second acoustic model; and the second acoustic model is used for evaluating the discrimination of phonemes in the voice data.

In an exemplary embodiment of the present invention, further comprising: and training the time delay deep neural network model based on the maximum mutual information criterion.

In an exemplary embodiment of the present invention, inputting the speech data into a first acoustic model and a second acoustic model respectively to obtain first acoustic data and second acoustic data includes: performing feature extraction on the voice data to generate voice feature data; inputting the voice feature data into the first acoustic model and the second acoustic model respectively to obtain the first acoustic data and the second acoustic data.

In an exemplary embodiment of the present invention, decoding based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score includes: acquiring a preset pronunciation dictionary; generating the first pronunciation quality score and the second pronunciation quality score based on the pronunciation dictionary, the first acoustic data, the second acoustic data, the assistance data, and a pronunciation quality algorithm.

In an exemplary embodiment of the present invention, generating the first pronunciation quality score and the second pronunciation quality score based on the pronunciation dictionary, the first acoustic data, the second acoustic data, the auxiliary data, and a pronunciation quality algorithm includes: decoding the first acoustic data based on the pronunciation quality algorithm, the pronunciation dictionary, and the auxiliary data to obtain the first pronunciation quality score; and decoding the second acoustic data based on the pronunciation quality algorithm, the pronunciation dictionary and the auxiliary data to obtain the second pronunciation quality score.

In an exemplary embodiment of the present invention, determining an evaluation result of the speech data to be evaluated based on the first pronunciation quality score and the second pronunciation quality score includes: and fusing the first pronunciation quality score and the second pronunciation quality score to determine the evaluation result of the voice data to be evaluated.

In an exemplary embodiment of the present invention, fusing the first pronunciation quality score and the second pronunciation quality score to determine an evaluation result of the speech data to be evaluated, includes: linearly weighting the first pronunciation quality score and the second pronunciation quality score to determine an evaluation result of the voice data to be evaluated; or judging the first pronunciation quality score and the second pronunciation quality score based on a threshold value to determine the evaluation result of the voice data to be evaluated.

According to an aspect of the present invention, there is provided a voice evaluation apparatus including: the data module is used for acquiring voice data to be evaluated and text data corresponding to the voice data; the model calculation module is used for inputting the voice data into a first acoustic model and a second acoustic model respectively to obtain first acoustic data and second acoustic data; the network computing module is used for inputting the text data into a text network to generate auxiliary data; a decoding module, configured to decode based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score; and the evaluation module is used for determining an evaluation result of the voice data to be evaluated based on the first pronunciation quality score and the second pronunciation quality score.

According to an aspect of the present invention, there is provided an electronic apparatus including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.

According to an aspect of the invention, a computer-readable medium is proposed, on which a computer program is stored which, when being executed by a processor, carries out the method as above.

According to the voice evaluation method, the voice evaluation device, the electronic equipment and the computer readable medium, the voice data to be evaluated and the text data corresponding to the voice data are obtained; inputting the voice data into a first acoustic model and a second acoustic model respectively to obtain first acoustic data and second acoustic data; inputting the text data into a text network to generate auxiliary data; decoding based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score; the mode of determining the evaluation result of the speech data to be evaluated based on the first pronunciation quality score and the second pronunciation quality score can effectively give consideration to the slowly-varying information of phonemes required in speech evaluation and the distinguishing information between different phonemes, and can describe the accuracy of phoneme levels more accurately, so that the output evaluation information of words, sentences and paragraph levels is more accurate.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are only some embodiments of the invention and other drawings may be derived from those drawings by a person skilled in the art without inventive effort.

Fig. 1 is a system block diagram illustrating a speech assessment method and apparatus according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method of speech assessment according to an exemplary embodiment.

Fig. 3 is a flowchart illustrating a voice assessment method according to another exemplary embodiment.

Fig. 4 is a block diagram illustrating a voice evaluation device according to an exemplary embodiment.

FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment.

FIG. 6 is a block diagram illustrating a computer-readable medium in accordance with an example embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below could be termed a second component without departing from the teachings of the present concepts. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be appreciated by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or flow charts in the drawings are not necessarily required to practice the present invention and are, therefore, not intended to limit the scope of the present invention.

The inventor of the invention finds that the existing speech evaluating technology usually collects a large amount of speech data with high pronunciation quality to train an acoustic model, then constructs a recognition network by referring to texts, pronunciation dictionaries and the acoustic model, and finally gives posterior probability by a GOP (good of probability) algorithm to measure the pronunciation quality of a certain phoneme by a learner. If it is desired to obtain scores of more hierarchical sound segments, it is necessary to combine multiple features in a bottom-up order, and obtain scores of phonemes, words, sentences, paragraphs and chapters in turn by averaging or weighted averaging.

The existing voice evaluation technology basically outputs evaluation scores through a single voice evaluation system (GMM, NN and the like); moreover, the existing voice evaluation system basically adopts a single system for evaluation, and the complementarity among different systems is not utilized.

In view of the technical bottleneck in the prior art, the present invention provides a speech evaluation method and apparatus, and the following describes the content of the present invention in detail with reference to specific embodiments.

Fig. 1 is a system block diagram illustrating a speech assessment method and apparatus according to an exemplary embodiment. As shown in fig. 1, in the system fusion framework, after feature extraction, the speech to be evaluated is respectively sent to a first acoustic model (DNN) and a second acoustic model (FTDNN), combined with the input evaluation text, decoded and input into a first sound quality scoring model and a second sound quality scoring model to obtain a GOP1 and a GOP2, and then the outputs of the two systems are fused.

The score fusion is not limited to the linear weighting, the threshold determination, and the like. In a specific embodiment, the score may be calculated in a linear, weighted manner:

wherein

The existing voice evaluation system adopts either NN-CE criterion to well describe the slowly-varying information of the phoneme or NN-MMI criterion to well describe the distinguishing information of different phonemes, and the different criteria have advantages and disadvantages, so that a single system is difficult to consider.

In the invention, different neural network criterion systems are used for speech evaluation, and fusion is carried out at the GOP level, so that the slowly-varying information of phonemes required in the speech evaluation and the distinguishing information among different phonemes are effectively considered, and the phoneme level precision can be more accurately described, so that the output evaluation information at the word, sentence and paragraph level is more accurate, and the method fully utilizes the complementarity of different systems.

FIG. 2 is a flow diagram illustrating a method of speech assessment according to an exemplary embodiment. The voice evaluation method 20 includes at least steps S202 to S210.

As shown in fig. 2, in S202, voice data to be evaluated and text data corresponding thereto are acquired. The speech data to be evaluated may be a speech text in any language. The method can be applied to spoken language test scenes, wherein test subjects are text data, and test records are voice data to be evaluated.

In S204, the voice data is respectively input into the first acoustic model and the second acoustic model, so as to obtain first acoustic data and second acoustic data. For example, feature extraction is performed on the voice data to generate voice feature data; inputting the voice feature data into the first acoustic model and the second acoustic model respectively to obtain the first acoustic data and the second acoustic data.

In one embodiment, the speech data may be processed by front-end signal processing, end-point detection, etc., and then extracted from frames by frames, where the conventional feature types include MFCC, PLP, FBANK, etc., and the extracted features are sent to the first acoustic model and the second acoustic model.

In S206, the text data is input into a text network, and auxiliary data is generated. The text network can also be a pre-trained voice model for providing comparison voice data for decoding the voice data to be evaluated.

In S208, decoding is performed based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score. A preset pronunciation dictionary may be obtained, for example; generating the first pronunciation quality score and the second pronunciation quality score based on the pronunciation dictionary, the first acoustic data, the second acoustic data, the assistance data, and a pronunciation quality algorithm.

Under the joint calculation of the acoustic model, the language model and the pronunciation dictionary, the word sequence which is the most matched is found and is output as a recognition result. The acoustic model mainly describes likelihood probability of features under the pronunciation model; the language model mainly describes the connection probability among words; the pronunciation dictionary is mainly used for converting words and chords, wherein the acoustic model modeling unit can be a triphone model.

In S210, an evaluation result of the voice data to be evaluated is determined based on the first pronunciation quality score and the second pronunciation quality score. The first pronunciation quality score and the second pronunciation quality score may be fused, for example, to determine an evaluation result of the speech data to be evaluated.

The fusion of the first pronunciation quality score and the second pronunciation quality score to determine the evaluation result of the voice data to be evaluated comprises the following steps: linearly weighting the first pronunciation quality score and the second pronunciation quality score to determine an evaluation result of the voice data to be evaluated; or judging the first pronunciation quality score and the second pronunciation quality score based on a threshold value to determine the evaluation result of the voice data to be evaluated.

According to the voice evaluation method, voice data to be evaluated and text data corresponding to the voice data are obtained; inputting the voice data into a first acoustic model and a second acoustic model respectively to obtain first acoustic data and second acoustic data; inputting the text data into a text network to generate auxiliary data; decoding based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score; the mode of determining the evaluation result of the speech data to be evaluated based on the first pronunciation quality score and the second pronunciation quality score can effectively give consideration to the slowly-varying information of phonemes required in speech evaluation and the distinguishing information between different phonemes, and can describe the accuracy of phoneme levels more accurately, so that the output evaluation information of words, sentences and paragraph levels is more accurate.

It should be clearly understood that the present disclosure describes how to make and use particular examples, but the principles of the present disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

Fig. 3 is a flowchart illustrating a voice assessment method according to another exemplary embodiment. The process 30 shown in fig. 3 is a further description of the process shown in fig. 2.

As shown in fig. 3, in S302, a deep neural network model is trained by speech training data to generate the first acoustic model. Wherein the first acoustic model is used for evaluating the change of phonemes in the voice data. The super-strong feature learning capability of the deep neural network greatly simplifies the feature extraction process and reduces the dependence of modeling on expert experience, so that the modeling process gradually turns from the previous complicated multistep process to a simple end-to-end modeling process, and the influence caused by the process is that the modeling unit gradually evolves from a state and a triphone model to larger units such as syllables, words and the like.

In one embodiment, further comprising: and training the deep neural network model based on a cross entropy criterion. The cross entropy describes the distance between two probability distributions, the smaller the cross entropy, the closer the two probability distributions are. Minimizing the CE criterion is equivalent to minimizing the KL distance of the "empirical probability distribution" from the "DNN estimated probability distribution".

The deep neural network DNN comprises a plurality of hidden layers, the CE criterion is adopted for training, the output layer is hmm-states, and then the grades of phonemes, words and sentences are obtained through a GOP algorithm; the CE criterion is able to describe the slowly varying information of phonemes well, but is not able to describe the differences between phonemes. In the embodiment of the invention, the slowly-changed information of the phoneme can be well described based on the output of GOP posterior of a DNN (deep Neural networks) system.

In S304, the time delay deep neural network model is trained by the speech training data to generate the second acoustic model. And the second acoustic model is used for evaluating the discrimination of phonemes in the voice data.

In one embodiment, further comprising: and training the time delay deep neural network model based on the maximum mutual information criterion. MMI criterion (maximum mutual information criterion) which aims at maximizing the mutual information of the word sequence distribution and the observation sequence distribution.

The deep neural network based on the time delay has two advantages that the network can see larger time domain range characteristics by adding the delay in the first step, and can more accurately describe the distinguishing information between phonemes by adopting the MMI training criterion of lattice-free in the second step. In the embodiment of the invention, GOP posterior output based on a TDNN (time Delay Neural networks) system has strong capability of distinguishing the good and bad phonemes.

In S306, feature extraction is performed on the voice data to generate voice feature data.

In S308, the speech feature data is respectively input into the first acoustic model and the second acoustic model to obtain the first acoustic data and the second acoustic data.

In S310, decoding is performed based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score. The pronunciation quality score may be one or more of phoneme (phoneme), syllable (syllabe), and word (word).

In one embodiment, the first acoustic data may be decoded to obtain the first voice quality score, for example, based on the voice quality algorithm, the voice dictionary, and the assistance data.

After extracting features from a section of speech of a learner, calculating the GOP posterior probability of a phoneme level through a recognition network:

wherein, t_sAnd t_eRepresenting the start and end times of the phoneme, respectively, the numerator P (O)_i|p_i；t_s，t_e) Represents an observation vector O_iIn model p_iThe likelihood score of the lower, in terms of forced alignment, may be obtained from the decoding path,

denominator max_q∈QP(O_i|q；t_s，t_e) Obtaining the phoneme cycle network approximation related to the text, wherein Q represents a set of all phoneme models in the reference text;

phoneme level confidence score p_icmWherein p is_cmCan be obtained from the identification network through a forward and backward algorithm;

posterior probability W of word hierarchy_iAnd a confidence score W_icmCan be obtained by averaging the posterior probabilities and confidence scores of the phoneme hierarchy, respectively, as follows:

wherein N represents the number of phonemes contained in the word;

score of last sentence level S_rMay be obtained by a weighted average of the confidence score and the posterior probability of the word

The formula is as follows:

wherein M is the number of words in the sentence,

beta is a weighting coefficient respectively, satisfies

Beta is more than or equal to 0 and less than or equal to 1, and

in one embodiment, the second pronunciation quality score may be decoded from the second acoustic data, for example, based on the pronunciation quality algorithm, the pronunciation dictionary, and the auxiliary data.

Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. The computer program, when executed by the CPU, performs the functions defined by the method provided by the present invention. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.

Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.

Fig. 4 is a block diagram illustrating a voice evaluation device according to an exemplary embodiment. As shown in fig. 4, the voice evaluation device 40 includes: a data module 402, a model calculation module 404, a network calculation module 406, a decoding module 408, and an evaluation module 410.

The data module 402 is used for acquiring voice data to be evaluated and text data corresponding to the voice data;

the model calculation module 404 is configured to input the speech data into a first acoustic model and a second acoustic model, respectively, to obtain first acoustic data and second acoustic data;

the network computing module 406 is configured to input the text data into a text network, and generate auxiliary data;

the decoding module 408 is configured to decode based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score;

the evaluation module 410 is configured to determine an evaluation result of the voice data to be evaluated based on the first pronunciation quality score and the second pronunciation quality score.

According to the voice evaluation device, voice data to be evaluated and text data corresponding to the voice data are obtained; inputting the voice data into a first acoustic model and a second acoustic model respectively to obtain first acoustic data and second acoustic data; inputting the text data into a text network to generate auxiliary data; decoding based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score; the mode of determining the evaluation result of the speech data to be evaluated based on the first pronunciation quality score and the second pronunciation quality score can effectively give consideration to the slowly-varying information of phonemes required in speech evaluation and the distinguishing information between different phonemes, and can describe the accuracy of phoneme levels more accurately, so that the output evaluation information of words, sentences and paragraph levels is more accurate.

An electronic device 500 according to this embodiment of the invention is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 may include, but are not limited to: at least one processing unit 510, at least one memory unit 520, a bus 530 that couples various system components including the memory unit 520 and the processing unit 510, a display unit 540, and the like.

Wherein the storage unit stores program code that is executable by the processing unit 510 to cause the processing unit 510 to perform steps according to various exemplary embodiments of the present invention described in this specification. For example, the processing unit 510 may perform the steps as shown in fig. 2, fig. 3.

The memory unit 520 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM)5201 and/or a cache memory unit 5202, and may further include a read only memory unit (ROM) 5203.

The memory unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 530 may be one or more of any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 500 may also communicate with one or more external devices 500' (e.g., keyboard, pointing device, bluetooth device, etc.), such that a user can communicate with devices with which the electronic device 500 interacts, and/or any devices (e.g., router, modem, etc.) with which the electronic device 500 can communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 550. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 560. The network adapter 560 may communicate with other modules of the electronic device 500 via the bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, as shown in fig. 6, the technical solution according to the embodiment of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiment of the present invention.

The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: acquiring voice data to be evaluated and text data corresponding to the voice data; inputting the voice data into a first acoustic model and a second acoustic model respectively to obtain first acoustic data and second acoustic data; inputting the text data into a text network to generate auxiliary data; decoding based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score; determining an evaluation result of the voice data to be evaluated based on the first pronunciation quality score and the second pronunciation quality score.

Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.

Exemplary embodiments of the present invention are specifically illustrated and described above. It is to be understood that the invention is not limited to the precise construction, arrangements, or instrumentalities described herein; on the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A speech assessment method, comprising:

acquiring voice data to be evaluated and text data corresponding to the voice data;

inputting the voice data into a first acoustic model and a second acoustic model respectively to obtain first acoustic data and second acoustic data;

inputting the text data into a text network to generate auxiliary data;

decoding based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score;

determining an evaluation result of the voice data to be evaluated based on the first pronunciation quality score and the second pronunciation quality score.

2. The method of claim 1, further comprising:

training a deep neural network model through speech training data to generate the first acoustic model;

wherein the first acoustic model is used for evaluating the change of phonemes in the voice data.

3. The method of claim 2, further comprising:

and training the deep neural network model based on a cross entropy criterion.

4. The method of claim 1, further comprising:

training a time delay deep neural network model through voice training data to generate the second acoustic model;

and the second acoustic model is used for evaluating the discrimination of phonemes in the voice data.

5. The method of claim 4, further comprising:

and training the time delay deep neural network model based on the maximum mutual information criterion.

6. The method of claim 1, wherein inputting the speech data into a first acoustic model and a second acoustic model, respectively, resulting in first acoustic data and second acoustic data, comprises:

performing feature extraction on the voice data to generate voice feature data;

inputting the voice feature data into the first acoustic model and the second acoustic model respectively to obtain the first acoustic data and the second acoustic data.

7. The method of claim 1, wherein decoding based on the first acoustic data, the second acoustic data, and the auxiliary data results in a first pronunciation quality score and a second pronunciation quality score, comprising:

acquiring a preset pronunciation dictionary;

generating the first pronunciation quality score and the second pronunciation quality score based on the pronunciation dictionary, the first acoustic data, the second acoustic data, the assistance data, and a pronunciation quality algorithm.

8. The method of claim 7, wherein generating the first pronunciation quality score and the second pronunciation quality score based on the pronunciation dictionary, the first acoustic data, the second acoustic data, the assistance data, and a pronunciation quality algorithm comprises:

decoding the first acoustic data based on the pronunciation quality algorithm, the pronunciation dictionary, and the auxiliary data to obtain the first pronunciation quality score;

and decoding the second acoustic data based on the pronunciation quality algorithm, the pronunciation dictionary and the auxiliary data to obtain the second pronunciation quality score.

9. The method of claim 1, wherein determining an assessment result for the speech data to be assessed based on the first pronunciation quality score and the second pronunciation quality score comprises:

and fusing the first pronunciation quality score and the second pronunciation quality score to determine the evaluation result of the voice data to be evaluated.

10. The method of claim 9, wherein fusing the first pronunciation quality score and the second pronunciation quality score to determine an assessment result for the speech data to be assessed comprises:

linearly weighting the first pronunciation quality score and the second pronunciation quality score to determine an evaluation result of the voice data to be evaluated; or

And judging the first pronunciation quality score and the second pronunciation quality score based on a threshold value to determine the evaluation result of the voice data to be evaluated.

11. A speech evaluation device characterized by comprising:

the data module is used for acquiring voice data to be evaluated and text data corresponding to the voice data;

the model calculation module is used for inputting the voice data into a first acoustic model and a second acoustic model respectively to obtain first acoustic data and second acoustic data;

the network computing module is used for inputting the text data into a text network to generate auxiliary data;

a decoding module, configured to decode based on the first acoustic data, the second acoustic data, and the auxiliary data to obtain a first pronunciation quality score and a second pronunciation quality score;

and the evaluation module is used for determining an evaluation result of the voice data to be evaluated based on the first pronunciation quality score and the second pronunciation quality score.

12. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.

13. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-10.