CN110322895B - Voice evaluation method and computer storage medium - Google Patents

Voice evaluation method and computer storage medium Download PDF

Info

Publication number
CN110322895B
CN110322895B CN201810259445.XA CN201810259445A CN110322895B CN 110322895 B CN110322895 B CN 110322895B CN 201810259445 A CN201810259445 A CN 201810259445A CN 110322895 B CN110322895 B CN 110322895B
Authority
CN
China
Prior art keywords
evaluated
vector
data
similarity
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810259445.XA
Other languages
Chinese (zh)
Other versions
CN110322895A (en
Inventor
吴介圣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yidu Huida Education Technology Co ltd
Original Assignee
Beijing Yidu Huida Education Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yidu Huida Education Technology Co ltd filed Critical Beijing Yidu Huida Education Technology Co ltd
Priority to CN201810259445.XA priority Critical patent/CN110322895B/en
Publication of CN110322895A publication Critical patent/CN110322895A/en
Application granted granted Critical
Publication of CN110322895B publication Critical patent/CN110322895B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a voice evaluation method and a computer storage medium. The voice evaluation method comprises the following steps: generating a to-be-evaluated vector of the to-be-evaluated voice data according to text data corresponding to the to-be-evaluated voice data; calculating the similarity between a preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated and enabling the calculated similarity to be a numerical value larger than 0; and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule. The voice evaluation method can evaluate learning achievement in an expression training course.

Description

Voice evaluation method and computer storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a voice evaluation method and a computer storage medium.
Background
With the development of computer and internet technologies, learning and teaching by means of computers and the internet has become a trend. Through the computer and the internet, students can study at any time and any place without being limited by environmental factors such as places, people numbers and the like. Especially in the aspect of education of the children of the low ages, the blank of the existing education of the children of the low ages is filled up by utilizing computers and the internet to carry out the education of the children of the low ages.
Taking language expression training of children aged 3-8 years by means of computers and the internet as an example, the existing expression training process is as follows: a group of interesting pictures are displayed to students through a computer or a mobile terminal device, and the students can perform expression training by describing the contents of the pictures.
In the existing expression training process, no feedback and judgment mechanism exists, so that students cannot well know the promotion degree after finishing training, know less learning conditions and are not beneficial to exciting the students to continue learning and making progress.
Disclosure of Invention
In view of this, embodiments of the present invention provide a speech evaluation method and a computer storage medium to solve the problem that the learning outcome of a student cannot be evaluated in the existing expression training course.
According to a first aspect of the embodiments of the present invention, there is provided a speech evaluation method, including: generating a to-be-evaluated vector of the to-be-evaluated voice data according to text data corresponding to the to-be-evaluated voice data; calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated and enabling the calculated similarity to be a numerical value larger than 0; and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule.
According to a second aspect of embodiments of the present invention, there is provided a computer storage medium storing: the instruction is used for generating a to-be-evaluated vector of the to-be-evaluated voice data according to the text data corresponding to the to-be-evaluated voice data; the command is used for calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated, and the calculated similarity is a numerical value larger than 0; and the instruction is used for generating and outputting the evaluation result data of the voice data to be evaluated according to the similarity and the preset voice scoring rule.
According to the scheme provided by the embodiment of the invention, the voice evaluation method can be applied to an expression training course, and evaluation is carried out on the voice input by a user, for example, text data corresponding to the voice data to be evaluated is converted into a vector to be evaluated, and the similarity between the vector to be evaluated and a preset standard content vector is evaluated to obtain evaluation result data, so that the expression ability of the user is represented by the evaluation result data, the user can clearly and conveniently know the expression level of the user, and the user is stimulated to continue learning and progress.
When the voice evaluation method evaluates voice data to be evaluated, corresponding text data is converted into a vector to be evaluated, a similarity calculation model is utilized, similarity is calculated according to a cosine value between the vector to be evaluated and a preset standard content vector, the calculated similarity is a numerical value larger than 0, and the problem that the existing similarity calculation method is low in evaluation accuracy when voice evaluation is carried out is solved. In addition, the problem that a negative value exists in similarity calculation is solved, so that the voice evaluation result is more accurate.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and it is also possible for a person skilled in the art to obtain other drawings based on the drawings.
FIG. 1 is a flow chart illustrating steps of a speech evaluation method according to a first embodiment of the present invention;
FIG. 2 is a flow chart illustrating steps of a speech evaluation method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a doc2vec model used in the speech evaluation method in the embodiment shown in fig. 2.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention shall fall within the scope of the protection of the embodiments of the present invention.
Example one
Referring to fig. 1, a flowchart illustrating steps of a speech evaluation method according to a first embodiment of the present invention is shown.
The voice evaluation method of the embodiment comprises the following steps:
step S101: and generating a vector to be evaluated of the voice data to be evaluated according to the text data corresponding to the voice data to be evaluated.
The text data corresponding to the speech data to be evaluated can be obtained in any appropriate manner. For example, by using a speech recognition mode, speech data to be evaluated is recognized through a speech recognition model or algorithm, and corresponding text data is generated. The method can automatically convert the speech to be evaluated into the corresponding text data, and has high conversion efficiency and low labor intensity.
Of course, in other embodiments, the speech data to be evaluated may be manually converted into corresponding text data by manual transcription.
Similarly, the to-be-evaluated vector corresponding to the to-be-evaluated voice data can be obtained in any appropriate manner. For example, according to the text data corresponding to the speech data to be evaluated, a corresponding vector to be evaluated is generated through a deep learning model. Wherein, the deep learning model can be a Word2vec model, a doc2vec model, etc. The Word2vec model and the doc2vec model can convert text data into corresponding vectors to be evaluated on a semantic level, can well reflect the semantics of speech data to be evaluated, and have great benefits for ensuring the accuracy of speech evaluation on the semantic level subsequently. In addition, the recognized text data is converted into the vector to be evaluated, and then the voice evaluation is carried out on the basis of the vector to be evaluated, so that the accuracy of the voice evaluation is ensured, and the problem that the accuracy of the voice evaluation is influenced due to the fact that the recognized text is inaccurate due to similar and identical voices in the process of recognizing the voice data to be evaluated into the text data is solved.
Of course, in other embodiments, the text data may also be converted into the corresponding vector to be evaluated through a one-hot representation (one-hot prediction), a shallow semantic analysis (LSA), or the like.
Step S102: and calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated and enabling the calculated similarity to be a numerical value larger than 0.
And generating a preset standard content vector according to the reference answer data. For example, the reference answer data is converted into a corresponding standard content vector through a text vectorization model, such as the aforementioned Word2vec model, and the standard content vector is pre-stored in the computer device and/or the server.
Of course, in other embodiments, the reference answer data may be preset in the computer device and/or the server, and converted into the standard content vector if necessary.
The similarity calculation model is used for calculating the similarity between a preset standard content vector and a vector to be evaluated, so that the similarity between reference answer data (corresponding to the preset standard content vector) and text data (corresponding to the vector to be evaluated) is represented according to the similarity, and grading is carried out on the speech data to be evaluated according to the similarity between the reference answer data and the text data. In this embodiment, the similarity calculation model is configured to calculate the similarity according to a cosine value between a preset standard content vector and a vector to be evaluated, and make the calculated similarity a value greater than 0.
The similarity calculation model can determine the similarity according to the cosine values of the two vectors, and the calculated similarity is a numerical value larger than 0, so that the purpose of accurately and efficiently determining the similarity is achieved, the influence of homophonic and heteromorphic characters on the accuracy of the similarity in the existing method for determining the similarity through text keywords on voice evaluation is avoided, and the problem that the calculated similarity has a negative value due to the fact that the cosine values have the negative value and the voice evaluation result is inaccurate is solved.
Step S103: and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule.
Wherein the evaluation result data is used for representing the expression ability level of the user. For example, in an expression training session, the evaluation result data may characterize the expression level of the user. Depending on the level of expressive power to be characterized, different evaluation parameters may be included in the evaluation result data. For example, in the aforementioned expression training course, the evaluation result data includes semantic score, dynamics score, tone score, and the like.
In a feasible manner, the preset voice scoring rules include a rule that the score in the evaluation result data is positively correlated with the similarity, that is, the higher the similarity is, the higher the score is, the better the expression representation capability is.
The speech evaluation method can be applied to an expression training course, and speech input by a user is evaluated, for example, speech data to be evaluated is converted into a vector to be evaluated, the similarity between the vector to be evaluated and a preset standard content vector is evaluated to obtain evaluation result data, and the expression ability of the user is represented through the evaluation result data, so that the user can clearly and conveniently know the expression level of the user, and the user is stimulated to continue learning and progress.
When the voice evaluation method evaluates voice data to be evaluated, corresponding text data is converted into a vector to be evaluated, a similarity calculation model is utilized, similarity is calculated according to a cosine value between the vector to be evaluated and a preset standard content vector, the calculated similarity is a numerical value larger than 0, and the problem that the existing similarity calculation method is low in evaluation accuracy when voice evaluation is carried out is solved. In addition, the problem that a negative value exists in similarity calculation is solved, so that the voice evaluation result is more accurate.
The speech evaluation method of the embodiment can be implemented by any suitable device with a data processing function, including: various terminal devices, servers, and the like.
Example two
Referring to fig. 2, a flowchart illustrating steps of a speech evaluation method according to a second embodiment of the present invention is shown.
In the present embodiment, the speech evaluation method is described by way of example as applied to an expression training course, particularly to an expression training course applied to a young child (for example, a child aged 3 to 8 years). Of course, in other embodiments, the speech evaluation method may be applied to any other suitable scenarios, for example, a training evaluation scenario for artificial intelligence devices, and the like, which is not limited in this embodiment.
The voice evaluation method of the embodiment comprises the following steps:
step S201: and generating a vector to be evaluated of the voice data to be evaluated according to the text data corresponding to the voice data to be evaluated.
In the expression training course, a group of pictures and/or videos can be displayed to the user, and the user describes the contents in the pictures and/or videos through a language, so that the expression ability of the user is trained. In order to enable a user to more accurately master the learning condition of the user, the voice data formed by the language expressed by the user can be evaluated, so that the user can more intuitively and clearly know the learning condition of the user, the user is urged to continue learning, and the learning enthusiasm is improved.
When evaluating the voice data of the user so as to embody the expression ability of the user, the expression ability can be represented by adopting a proper parameter, for example, a picture is displayed to the user, and the matching degree of the semantics of the voice data input by the user and the content of the displayed picture is judged by calculating the similarity of the text data corresponding to the voice data of the user and the reference answer data corresponding to the picture, so that the expression ability is represented.
In one possible way, when performing speech assessment, step S201 includes the following sub-steps:
substep 1: and recognizing the voice data to be evaluated by using the voice recognition model to generate text data corresponding to the voice data to be evaluated.
Optionally, the voice data to be evaluated may be obtained first; transcoding the voice data to be evaluated and generating transcoded voice data to be evaluated; and taking the transcoded voice data to be evaluated as the input of a voice recognition model, and generating text data corresponding to the voice data to be evaluated through the voice recognition model.
The speech data to be evaluated can be user voice collected by a recording device, or recording input by a user, or the like, or user speech data stored in a database is extracted as speech data to be evaluated.
After the voice data to be evaluated is obtained, if the voice data to be evaluated meets the format required by the voice recognition model, the voice data to be evaluated can be directly used as the input of the voice recognition model, and the text data corresponding to the voice data to be evaluated is generated through the voice recognition model.
And if the voice data to be evaluated does not meet the format required by the voice recognition model, transcoding the voice data to be evaluated and generating transcoded voice data to be evaluated. For example, if the speech data to be evaluated is in mp3 format, the sampling rate is 44100Hz, and the audio channel 2, if the speech format does not conform to the input format required by the speech recognition model, the speech data is transcoded, and the transcoding may be performed in any appropriate manner, for example, in ffmpeg manner, where the converted speech format is: wav format, sample rate 16000Hz, audio channel 2, 16 bit.
And the transcoded voice data to be evaluated is used as the input of a voice recognition model, and text data corresponding to the voice data to be evaluated is generated through the voice recognition model.
Those skilled in the art can recognize the speech data to be evaluated by using an appropriate speech recognition model according to the needs, which is not limited in this embodiment. For example, the speech recognition Model may be a speech recognition Model based on an HMM (Hidden Markov Model) and an N-gram Model, or may be a speech recognition Model that directly calls an existing speech recognition tool.
It is inevitable that voices with the same or similar pronunciation are recognized as different characters in the voice recognition process, such as "lam", "do", "horse", and the like. After the speech data to be evaluated is identified and the text data is generated, the problem that the pronunciation of the converted characters is the same as that of the speech data to be evaluated but the characters are different exists in the text data possibly. Especially for voice data of children of low ages, the accuracy of voice recognition is reduced due to the inevitable problems of unclear expression, pause and the like of users.
If the text similarity calculation is performed by directly using the text data and the reference answer data by using the existing text similarity calculation model, the similarity calculation is inaccurate due to the inaccuracy of the recognized characters, and the accuracy of subsequent speech evaluation is affected. This is because the existing similarity calculation models such as T-IDF (term frequency-inverse document frequency model), LSI (shallow semantic indexing), and LDA (term Dirichlet Allocation model) all perform similarity calculation by keyword matching, and the accuracy of similarity calculation of keyword matching is affected by the accuracy of recognized characters.
In order to overcome the defects and improve the speech recognition accuracy, a set of speech recognition models based on the young children can be trained through a machine learning model, and the trained speech recognition models are used for recognizing the speech data to be evaluated, so that more accurate text data can be obtained.
Substep 2: and vectorizing the text data through a text vector calculation model to generate a to-be-evaluated vector of the to-be-evaluated voice data.
The text vector calculation model is used for converting the text data into corresponding vectors to be evaluated so as to perform subsequent text similarity calculation.
Optionally, a manner of implementing this step may include:
preprocessing the text data, and generating result data according to a preprocessing result, wherein the result data comprises word segmentation data used for indicating a plurality of words in the text data; and taking the word segmentation data of each word segmentation as the input of a text vector calculation model, and generating a to-be-evaluated vector of the to-be-evaluated voice data through the text vector calculation model.
In one possible approach, the pre-processing includes dirty data removal processing, word segmentation processing, and stop word removal processing.
Based on this, preprocessing the text data, and generating result data according to the preprocessing result includes:
pretreatment step 1: and performing dirty data removal processing on the text data, and obtaining effective text data.
The dirty data may be data in which the text is empty and the useful information of the text is little enough for training, for example, the number of words in the text is less than a preset threshold, and the text may be considered as dirty data with little useful information if the sentence in the text lacks a subject and a predicate.
For example: text 1: ,. Text 1 is null data. Text 2: there is a little bear. The useful information in text 2 is very little. These data are all dirty data. The particular method of removing dirty data may be in any suitable manner, such as, for example, existing methods of removing dirty data.
A pretreatment step 2: and performing word segmentation processing on the effective text data, and obtaining word segmentation data of a plurality of words in the effective text data.
The word segmentation process may be performed in any suitable manner, for example, using a Hidden Markov Model (HMM) based machine learning word segmentation Model.
As "text 3: for example, there is a bear on the picture, and after word segmentation processing, word segmentation data is as follows: picture/upper/present/one/bear.
A pretreatment step 3: and performing stop word removal processing on the word segmentation data of the plurality of word segmentations to obtain result data.
The stop word refers to a word which is removed from the information and/or text in the information and/or text processing for saving the storage space and improving the processing efficiency. The stop word may be determined according to the need, for example, the stop word may be an assistant word or a mood word, etc., for example: "and", "ground", "o", etc., may also be other words, such as "has", "upper", etc.
Taking the word segmentation data of the text 3 as an example, after the stop word is removed, the following steps are performed: one bear is shown in the picture.
And after the result data is obtained, the word segmentation data of each word segmentation is used as the input of a text vector calculation model, and a to-be-evaluated vector of the to-be-evaluated voice data is generated through the text vector calculation model.
Wherein, the text vector calculation model may be a deep learning model, such as: word2vec model, doc2vec model, etc. In this embodiment, a text vector calculation model is taken as doc2vec for example, a structure diagram of the doc2vec model is shown in fig. 3, and the doc2vec model includes an input layer (input layer), a hidden layer (hidden layer), and an output layer (output layer). The input layer is used for acquiring training sample data; the hidden layer is used for vectorizing the training sample data; the output layer is used for outputting the result.
In order to better adapt to voice data of children of low age, the doc2vec model can be trained by using the voice data of the children of low age, so that a vector to be evaluated of the voice data to be evaluated generated by the trained doc2vec model is better.
The training of the doc2vec model may be implemented by a conventional training method, for example, by the following training process:
first, words in each piece of training text data are initialized to an N-dimensional vector, where N may be determined to be an appropriate value as needed. Preferably, the value range of N is: 30-200, can be adjusted as required.
For example, the word segmentation data of the text data is: { picture one bear playing }; the term initialization vector dimension is 4, then the initialized vector of each term is:
picture x 1: [1,0,0,0](ii) a Where picture X1 is used to indicate the word "picture" as the first input (i.e., X as shown in FIG. 3)1)。
An x 2: [0,1,0,0](ii) a Where an X2 is used to indicate the word "a" as the second input (i.e., X as shown in FIG. 3)2)。
Bear x 3: [0,0,1,0 ]; where bear x3 is used to indicate the word "bear" as the third input.
Play x 4: [0,0,0,1 ]; where play x4 is used to indicate the word "play" as the fourth input.
When training the doc2vec model, x1, x2 and x4 can be used as the input of the model to be trained, i.e. the input layer in fig. 3 is x1, x2, x4, and x3 is used as the correction reference.
During training, the result of the hidden layer (hidden layer) is calculated by setting up a training parameter matrix w of the doc2vec model. The matrix w is illustrated in fig. 3 as a box between an input layer and a hidden layer. In one possible approach, the matrix w is as follows:
Figure BDA0001609875160000101
the result v2 of the Hidden Layer is calculated by x2 and the matrix w as follows:
Figure BDA0001609875160000102
similarly, x1 and x4 are calculated with the W matrix to obtain the results of the Hidden Layer, v1 and v 4. And (3) averaging the calculated v1, v2 and v4 to determine the hidden layer.
Figure BDA0001609875160000103
Then, a matrix o between the hidden layer and the output layer is established, the matrix o is multiplied by the hidden layer, and then the probability is obtained by sofmax, for example, the calculation result of the output layer is: [0.23,0.03,0.62,0.12], value 3,0.62 max, then close to the true expected [0,0,1,0 ]. And performing parameter optimization according to the output result of the output layer and the real expectation of x3, for example, adjusting the matrixes to obtain an optimal model, and completing the training of the doc2vec model.
After the doc2vec model is trained, the word segmentation data of each word segmentation is used as the input of a text vector calculation model (the trained doc2vec model), and a to-be-evaluated vector of the to-be-evaluated voice data is generated through the text vector calculation model. Because the text vector calculation model is trained, the optimal parameters in the use scene for the young children are obtained, and the accuracy of generating the to-be-evaluated vector by using the text vector calculation model is better.
Step S202: and calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated, and enabling the calculated similarity to be a numerical value larger than 0.
And generating a preset standard content vector according to the reference answer data. For example, the reference answer data is converted into a corresponding standard content vector through a text vectorization model, such as the aforementioned Word2vec model, and the standard content vector is pre-stored in the computer device and/or the server.
The similarity calculated by the similarity calculation model can be regarded as the similarity between the reference answer data and the text data corresponding to the speech data to be evaluated, so that the speech data to be evaluated is scored according to the similarity. In this embodiment, the similarity calculation model is configured to calculate the similarity according to a cosine value between a preset standard content vector and a vector to be evaluated, and make the calculated similarity a value greater than 0.
The similarity calculation model can determine the similarity according to the cosine values of the two vectors, and the calculated similarity is a numerical value larger than 0, so that the purpose of accurately and efficiently determining the similarity is achieved, the influence of homophonic and heteromorphic characters existing in the existing method for determining the similarity through text keywords on the accuracy of the similarity is avoided when voice evaluation is performed, and the problems that the calculated similarity has a negative value due to the fact that the cosine values have the negative value and the result of the voice evaluation is inaccurate are solved.
In one feasible mode, a preset standard content vector and a vector to be evaluated are used as input of a similarity calculation model, and the similarity is calculated through the similarity calculation model.
Optionally, in order to improve the accuracy of the evaluation, the similarity is a numerical value greater than 0 and less than or equal to 1. Therefore, the problem that the similarity existing in the existing cosine value indication similarity has a negative value and is not beneficial to accurately embodying the expression ability is solved, and the subsequent scoring according to the similarity is facilitated, so that the scoring is simpler and more convenient.
Optionally, the similarity calculation model includes:
Figure BDA0001609875160000121
wherein x isiFor indicating the vector to be evaluated, xjThe score is used for indicating the standard content vector corresponding to the vector to be evaluated, and is used for indicating the similarity. The similarity calculated by the similarity calculation model linearly changes along with the cosine value, the larger the cosine value is, the closer the similarity is to 1.0 after conversion, the smaller the cosine value is, the closer the similarity is to 0.0 after conversion, and thus the value range of the similarity is 0.0-1.0, and the similarity can be conveniently used for subsequently generating evaluation result data, such as evaluation scores.
Of course, in other embodiments, the similarity calculation model includes score ═ ecos(xi,xj)Wherein x isiFor indicating the vector to be evaluated, xjThe score is used for indicating the standard content vector corresponding to the vector to be evaluated, and is used for indicating the similarity.
Step S203: and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule.
Wherein the evaluation result data is used for representing the expression ability level of the user. For example, in an expression training session, the evaluation result data may characterize the expression level of the user. Depending on the level of expressive power to be characterized, different evaluation parameters may be included in the evaluation result data. For example, in the aforementioned expression training course, the evaluation result data includes semantic score, dynamics score, tone score, and the like.
In a feasible manner, the preset voice scoring rules include a rule that the score in the evaluation result data is positively correlated with the similarity, that is, the higher the similarity is, the higher the score is, the better the expression representation capability is.
The speech evaluation method can be applied to an expression training course, and by evaluating speech input by a user, for example, text data corresponding to speech data to be evaluated is converted into a vector to be evaluated, and the similarity between the vector to be evaluated and a preset standard content vector is evaluated to obtain evaluation result data, so that the expression ability of the user is represented by the evaluation result data, the user can clearly and conveniently know the expression level of the user, and the user is stimulated to continue learning and progress.
When the speech evaluation method is used for evaluating speech data to be evaluated, inconsistent text data recognized by speech are converted into numerical vectors through a semantic understanding technology to reduce errors, a similarity calculation model is used for calculating the similarity according to cosine values between the vectors to be evaluated and preset standard content vectors, the calculated similarity is a numerical value larger than 0, and the problem of low evaluation accuracy of the existing similarity calculation method during speech evaluation is solved. In addition, the problem that a negative value exists in similarity calculation is solved, so that the voice evaluation result is more accurate.
The speech evaluation method of the embodiment can be implemented by any suitable device with a data processing function, including: various terminal devices, servers, and the like.
EXAMPLE III
According to an embodiment of the present invention, there is provided a computer storage medium storing: the instruction is used for generating a to-be-evaluated vector of the to-be-evaluated voice data according to the text data corresponding to the to-be-evaluated voice data; the command is used for calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated, and the calculated similarity is a numerical value larger than 0; and the instruction is used for generating and outputting the evaluation result data of the voice data to be evaluated according to the similarity and the preset voice scoring rule.
Optionally, the instruction for calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and the similarity calculation model includes: and the instruction is used for calculating the similarity through the similarity calculation model by taking the preset standard content vector and the vector to be evaluated as the input of the similarity calculation model.
Alternatively, the similarity is a numerical value greater than 0 and equal to or less than 1.
Optionally, the similarity calculation model includes:
Figure BDA0001609875160000131
wherein x isiFor indicating the vector to be evaluated, xjThe score is used for indicating the standard content vector corresponding to the vector to be evaluated, and is used for indicating the similarity.
Optionally, the instruction for generating a to-be-evaluated vector of the to-be-evaluated voice data according to the text data corresponding to the to-be-evaluated voice data includes: the instruction is used for identifying the voice data to be evaluated by using the voice recognition model and generating text data corresponding to the voice data to be evaluated; and the instruction is used for vectorizing the text data through the text vector calculation model to generate a to-be-evaluated vector of the to-be-evaluated voice data.
Optionally, the instruction for recognizing the speech data to be evaluated by using the speech recognition model and generating text data corresponding to the speech data to be evaluated includes: the instruction is used for acquiring the voice data to be evaluated; the instruction is used for transcoding the voice data to be evaluated and generating transcoded voice data to be evaluated; and the instruction is used for taking the transcoded voice data to be evaluated as the input of the voice recognition model and generating text data corresponding to the voice data to be evaluated through the voice recognition model.
Optionally, the instruction for performing vectorization processing on the text data through the text vector calculation model to generate a to-be-evaluated vector of the to-be-evaluated speech data includes: instructions for preprocessing the text data and generating result data according to a preprocessing result, wherein the result data includes participle data for indicating a plurality of participles in the text data; and the command is used for taking the word segmentation data of each word segmentation as the input of the text vector calculation model and generating the to-be-evaluated vector of the to-be-evaluated voice data through the text vector calculation model.
Optionally, the preprocessing comprises dirty data removal processing, word segmentation processing and stop word removal processing; the instruction for preprocessing the text data and generating result data according to the preprocessing result comprises the following steps: instructions for performing dirty data removal processing on the text data and obtaining valid text data; the instruction is used for carrying out word segmentation processing on the effective text data and obtaining word segmentation data of a plurality of words in the effective text data; and the instruction is used for removing stop words from the participle data of the participles to obtain result data.
The instruction stored in the computer storage medium can convert inconsistent text data recognized by voice into numerical vectors through a semantic understanding technology to reduce errors when evaluating voice data to be evaluated, calculate similarity according to cosine values between the vector to be evaluated and a preset standard content vector by using a similarity calculation model, and enable the calculated similarity to be a numerical value larger than 0, so that the problem of low evaluation accuracy of the existing similarity calculation method when evaluating the voice data is solved. In addition, the problem that a negative value exists in similarity calculation is solved, so that the voice evaluation result is more accurate.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product that can be stored on a computer-readable storage medium including any mechanism for storing or transmitting information in a form readable by a computer (e.g., a computer). For example, a machine-readable medium includes Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage media, optical storage media, flash memory storage media, electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others, and the computer software product includes instructions for causing a computing device (which may be a personal computer, server, or network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus (device), or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims (9)

1. A speech evaluation method, comprising:
acquiring the content of pictures and/or videos displayed by a user through language description as voice data to be evaluated;
generating a to-be-evaluated vector of the to-be-evaluated voice data according to text data corresponding to the to-be-evaluated voice data;
calculating the similarity between a preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated, and enabling the calculated similarity to be a numerical value which is greater than 0 and less than or equal to 1, and the similarity calculation model comprises:
Figure FDA0003097732730000011
wherein x isiFor indicating the vector to be evaluated, xjFor indicating and awaiting evaluationThe standard content vector corresponding to the direction finding quantity is used for indicating the similarity;
and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule.
2. The method according to claim 1, wherein calculating the similarity between the preset standard content vector and the vector to be evaluated according to a preset standard content vector, the vector to be evaluated and a similarity calculation model comprises:
and taking a preset standard content vector and the vector to be evaluated as the input of the similarity calculation model, and calculating the similarity through the similarity calculation model.
3. The method according to claim 1, wherein the similarity is a numerical value greater than 0 and equal to or less than 1.
4. The method according to any one of claims 1 to 3, wherein the generating a to-be-evaluated vector of the to-be-evaluated voice data according to text data corresponding to the to-be-evaluated voice data comprises:
using a voice recognition model to recognize the voice data to be evaluated, and generating the text data corresponding to the voice data to be evaluated;
and vectorizing the text data through a text vector calculation model to generate a to-be-evaluated vector of the to-be-evaluated voice data.
5. The method according to claim 4, wherein the recognizing the speech data to be evaluated by using the speech recognition model to generate the text data corresponding to the speech data to be evaluated comprises:
acquiring the voice data to be evaluated;
transcoding the voice data to be evaluated, and generating transcoded voice data to be evaluated;
and taking the transcoded voice data to be evaluated as the input of the voice recognition model, and generating the text data corresponding to the voice data to be evaluated through the voice recognition model.
6. The method according to claim 4, wherein the vectorizing the text data through a text vector calculation model to generate a to-be-evaluated vector of the to-be-evaluated speech data includes:
preprocessing the text data and generating result data according to a preprocessing result, wherein the result data comprise word segmentation data used for indicating a plurality of word segmentations in the text data;
and taking the word segmentation data of each word segmentation as the input of a text vector calculation model, and generating a to-be-evaluated vector of the to-be-evaluated voice data through the text vector calculation model.
7. The method of claim 6, wherein the pre-processing comprises removing dirty data processing, participle processing, and removing stop word processing;
the preprocessing the text data and generating result data according to a preprocessing result comprises the following steps:
performing dirty data removal processing on the text data, and obtaining effective text data;
performing word segmentation processing on the effective text data, and obtaining word segmentation data of a plurality of words in the effective text data;
and performing stop word removal processing on the word segmentation data of the plurality of word segmentations to obtain the result data.
8. A computer storage medium, the computer storage medium having stored thereon: acquiring the content of pictures and/or videos displayed by a user through language description as voice data to be evaluated; the instruction is used for generating a to-be-evaluated vector of the to-be-evaluated voice data according to text data corresponding to the to-be-evaluated voice data;the method comprises the following steps of calculating the similarity between a preset standard content vector and a vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated, and enabling the calculated similarity to be a numerical value which is larger than 0 and smaller than or equal to 1, and the similarity calculation model comprises the following steps:
Figure FDA0003097732730000031
wherein x isiFor indicating the vector to be evaluated, xjThe score is used for indicating the similarity of the standard content vector corresponding to the vector to be evaluated; and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule.
9. The computer storage medium of claim 8, wherein the instructions for calculating the similarity between the preset standard content vector and the vector to be evaluated according to a preset standard content vector, the vector to be evaluated and a similarity calculation model comprise: and the instruction is used for calculating the similarity through the similarity calculation model by taking a preset standard content vector and the vector to be evaluated as the input of the similarity calculation model.
CN201810259445.XA 2018-03-27 2018-03-27 Voice evaluation method and computer storage medium Active CN110322895B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810259445.XA CN110322895B (en) 2018-03-27 2018-03-27 Voice evaluation method and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810259445.XA CN110322895B (en) 2018-03-27 2018-03-27 Voice evaluation method and computer storage medium

Publications (2)

Publication Number Publication Date
CN110322895A CN110322895A (en) 2019-10-11
CN110322895B true CN110322895B (en) 2021-07-09

Family

ID=68109770

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810259445.XA Active CN110322895B (en) 2018-03-27 2018-03-27 Voice evaluation method and computer storage medium

Country Status (1)

Country Link
CN (1) CN110322895B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827794B (en) * 2019-12-06 2022-06-07 科大讯飞股份有限公司 Method and device for evaluating quality of voice recognition intermediate result
CN111199750B (en) * 2019-12-18 2022-10-28 北京葡萄智学科技有限公司 Pronunciation evaluation method and device, electronic equipment and storage medium
CN111639219A (en) * 2020-05-12 2020-09-08 广东小天才科技有限公司 Method for acquiring spoken language evaluation sticker, terminal device and storage medium
CN112435512B (en) * 2020-11-12 2023-01-24 郑州大学 Voice behavior assessment and evaluation method for rail transit simulation training
CN112562736B (en) * 2020-12-11 2024-06-21 中国信息通信研究院 Voice data set quality assessment method and device

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008056570A1 (en) * 2006-11-09 2008-05-15 Panasonic Corporation Content search apparatus
CN103823896B (en) * 2014-03-13 2017-02-15 蚌埠医学院 Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm
US9542948B2 (en) * 2014-04-09 2017-01-10 Google Inc. Text-dependent speaker identification
CN104240706B (en) * 2014-09-12 2017-08-15 浙江大学 It is a kind of that the method for distinguishing speek person that similarity corrects score is matched based on GMM Token
CN104464757B (en) * 2014-10-28 2019-01-18 科大讯飞股份有限公司 Speech evaluating method and speech evaluating device
CN104505103B (en) * 2014-12-04 2018-07-03 上海流利说信息技术有限公司 Voice quality assessment equipment, method and system
CN104867489B (en) * 2015-04-27 2019-04-26 苏州大学张家港工业技术研究院 A kind of simulation true man read aloud the method and system of pronunciation
CN104778161B (en) * 2015-04-30 2017-07-07 车智互联(北京)科技有限公司 Based on Word2Vec and Query log extracting keywords methods
CN105989849B (en) * 2015-06-03 2019-12-03 乐融致新电子科技(天津)有限公司 A kind of sound enhancement method, audio recognition method, clustering method and device
CN105261362B (en) * 2015-09-07 2019-07-05 科大讯飞股份有限公司 A kind of call voice monitoring method and system
CN105608180A (en) * 2015-12-22 2016-05-25 北京奇虎科技有限公司 Application recommendation method and system
CN105898713A (en) * 2016-06-17 2016-08-24 东华大学 WiFi fingerprint indoor positioning method based on weighted cosine similarity
CN106503805B (en) * 2016-11-14 2019-01-29 合肥工业大学 A kind of bimodal based on machine learning everybody talk with sentiment analysis method
CN106776559B (en) * 2016-12-14 2020-08-11 东软集团股份有限公司 Text semantic similarity calculation method and device
CN106802956B (en) * 2017-01-19 2020-06-05 山东大学 Movie recommendation method based on weighted heterogeneous information network
CN106847288B (en) * 2017-02-17 2020-12-25 上海创米科技有限公司 Error correction method and device for voice recognition text
CN107316638A (en) * 2017-06-28 2017-11-03 北京粉笔未来科技有限公司 A kind of poem recites evaluating method and system, a kind of terminal and storage medium
CN107355342B (en) * 2017-06-30 2019-04-23 北京金风科创风电设备有限公司 The recognition methods of wind generating set pitch control exception and device
CN107346340A (en) * 2017-07-04 2017-11-14 北京奇艺世纪科技有限公司 A kind of user view recognition methods and system
CN107766426B (en) * 2017-09-14 2020-05-22 北京百分点信息科技有限公司 Text classification method and device and electronic equipment
CN107773982B (en) * 2017-10-20 2021-08-13 科大讯飞股份有限公司 Game voice interaction method and device
CN107729322B (en) * 2017-11-06 2021-01-12 广州杰赛科技股份有限公司 Word segmentation method and device and sentence vector generation model establishment method and device
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity
CN110136721A (en) * 2019-04-09 2019-08-16 北京大米科技有限公司 A kind of scoring generation method, device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN110322895A (en) 2019-10-11

Similar Documents

Publication Publication Date Title
CN110322895B (en) Voice evaluation method and computer storage medium
CN105741832B (en) Spoken language evaluation method and system based on deep learning
US10621975B2 (en) Machine training for native language and fluency identification
CN108766415B (en) Voice evaluation method
US20120221339A1 (en) Method, apparatus for synthesizing speech and acoustic model training method for speech synthesis
CN112966106A (en) Text emotion recognition method, device and equipment and storage medium
CN112397056B (en) Voice evaluation method and computer storage medium
CN112331180A (en) Spoken language evaluation method and device
CN112233680A (en) Speaker role identification method and device, electronic equipment and storage medium
CN112489688A (en) Neural network-based emotion recognition method, device and medium
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN117711444B (en) Interaction method, device, equipment and storage medium based on talent expression
CN113486970B (en) Reading capability evaluation method and device
CN113853651B (en) Apparatus and method for speech-emotion recognition with quantized emotion state
CN115132174A (en) Voice data processing method and device, computer equipment and storage medium
US11615787B2 (en) Dialogue system and method of controlling the same
Coto‐Solano Computational sociophonetics using automatic speech recognition
JP6082657B2 (en) Pose assignment model selection device, pose assignment device, method and program thereof
CN113409768A (en) Pronunciation detection method, pronunciation detection device and computer readable medium
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
CN116597809A (en) Multi-tone word disambiguation method, device, electronic equipment and readable storage medium
Huang et al. English mispronunciation detection based on improved GOP methods for Chinese students
Knill et al. Use of graphemic lexicons for spoken language assessment
US9928754B2 (en) Systems and methods for generating recitation items
CN115116474A (en) Spoken language scoring model training method, scoring method, device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant