CN110322895B

CN110322895B - Voice evaluation method and computer storage medium

Info

Publication number: CN110322895B
Application number: CN201810259445.XA
Authority: CN
Inventors: 吴介圣
Original assignee: Beijing Yidu Huida Education Technology Co ltd
Current assignee: Beijing Yidu Huida Education Technology Co ltd
Priority date: 2018-03-27
Filing date: 2018-03-27
Publication date: 2021-07-09
Anticipated expiration: 2038-03-27
Also published as: CN110322895A

Abstract

The embodiment of the invention provides a voice evaluation method and a computer storage medium. The voice evaluation method comprises the following steps: generating a to-be-evaluated vector of the to-be-evaluated voice data according to text data corresponding to the to-be-evaluated voice data; calculating the similarity between a preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated and enabling the calculated similarity to be a numerical value larger than 0; and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule. The voice evaluation method can evaluate learning achievement in an expression training course.

Description

Voice evaluation method and computer storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a voice evaluation method and a computer storage medium.

Background

With the development of computer and internet technologies, learning and teaching by means of computers and the internet has become a trend. Through the computer and the internet, students can study at any time and any place without being limited by environmental factors such as places, people numbers and the like. Especially in the aspect of education of the children of the low ages, the blank of the existing education of the children of the low ages is filled up by utilizing computers and the internet to carry out the education of the children of the low ages.

Taking language expression training of children aged 3-8 years by means of computers and the internet as an example, the existing expression training process is as follows: a group of interesting pictures are displayed to students through a computer or a mobile terminal device, and the students can perform expression training by describing the contents of the pictures.

In the existing expression training process, no feedback and judgment mechanism exists, so that students cannot well know the promotion degree after finishing training, know less learning conditions and are not beneficial to exciting the students to continue learning and making progress.

Disclosure of Invention

In view of this, embodiments of the present invention provide a speech evaluation method and a computer storage medium to solve the problem that the learning outcome of a student cannot be evaluated in the existing expression training course.

According to a first aspect of the embodiments of the present invention, there is provided a speech evaluation method, including: generating a to-be-evaluated vector of the to-be-evaluated voice data according to text data corresponding to the to-be-evaluated voice data; calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated and enabling the calculated similarity to be a numerical value larger than 0; and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule.

According to a second aspect of embodiments of the present invention, there is provided a computer storage medium storing: the instruction is used for generating a to-be-evaluated vector of the to-be-evaluated voice data according to the text data corresponding to the to-be-evaluated voice data; the command is used for calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated, and the calculated similarity is a numerical value larger than 0; and the instruction is used for generating and outputting the evaluation result data of the voice data to be evaluated according to the similarity and the preset voice scoring rule.

According to the scheme provided by the embodiment of the invention, the voice evaluation method can be applied to an expression training course, and evaluation is carried out on the voice input by a user, for example, text data corresponding to the voice data to be evaluated is converted into a vector to be evaluated, and the similarity between the vector to be evaluated and a preset standard content vector is evaluated to obtain evaluation result data, so that the expression ability of the user is represented by the evaluation result data, the user can clearly and conveniently know the expression level of the user, and the user is stimulated to continue learning and progress.

When the voice evaluation method evaluates voice data to be evaluated, corresponding text data is converted into a vector to be evaluated, a similarity calculation model is utilized, similarity is calculated according to a cosine value between the vector to be evaluated and a preset standard content vector, the calculated similarity is a numerical value larger than 0, and the problem that the existing similarity calculation method is low in evaluation accuracy when voice evaluation is carried out is solved. In addition, the problem that a negative value exists in similarity calculation is solved, so that the voice evaluation result is more accurate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and it is also possible for a person skilled in the art to obtain other drawings based on the drawings.

FIG. 1 is a flow chart illustrating steps of a speech evaluation method according to a first embodiment of the present invention;

FIG. 2 is a flow chart illustrating steps of a speech evaluation method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a doc2vec model used in the speech evaluation method in the embodiment shown in fig. 2.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention shall fall within the scope of the protection of the embodiments of the present invention.

Example one

Referring to fig. 1, a flowchart illustrating steps of a speech evaluation method according to a first embodiment of the present invention is shown.

The voice evaluation method of the embodiment comprises the following steps:

step S101: and generating a vector to be evaluated of the voice data to be evaluated according to the text data corresponding to the voice data to be evaluated.

The text data corresponding to the speech data to be evaluated can be obtained in any appropriate manner. For example, by using a speech recognition mode, speech data to be evaluated is recognized through a speech recognition model or algorithm, and corresponding text data is generated. The method can automatically convert the speech to be evaluated into the corresponding text data, and has high conversion efficiency and low labor intensity.

Of course, in other embodiments, the speech data to be evaluated may be manually converted into corresponding text data by manual transcription.

Similarly, the to-be-evaluated vector corresponding to the to-be-evaluated voice data can be obtained in any appropriate manner. For example, according to the text data corresponding to the speech data to be evaluated, a corresponding vector to be evaluated is generated through a deep learning model. Wherein, the deep learning model can be a Word2vec model, a doc2vec model, etc. The Word2vec model and the doc2vec model can convert text data into corresponding vectors to be evaluated on a semantic level, can well reflect the semantics of speech data to be evaluated, and have great benefits for ensuring the accuracy of speech evaluation on the semantic level subsequently. In addition, the recognized text data is converted into the vector to be evaluated, and then the voice evaluation is carried out on the basis of the vector to be evaluated, so that the accuracy of the voice evaluation is ensured, and the problem that the accuracy of the voice evaluation is influenced due to the fact that the recognized text is inaccurate due to similar and identical voices in the process of recognizing the voice data to be evaluated into the text data is solved.

Of course, in other embodiments, the text data may also be converted into the corresponding vector to be evaluated through a one-hot representation (one-hot prediction), a shallow semantic analysis (LSA), or the like.

Step S102: and calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated and enabling the calculated similarity to be a numerical value larger than 0.

And generating a preset standard content vector according to the reference answer data. For example, the reference answer data is converted into a corresponding standard content vector through a text vectorization model, such as the aforementioned Word2vec model, and the standard content vector is pre-stored in the computer device and/or the server.

Of course, in other embodiments, the reference answer data may be preset in the computer device and/or the server, and converted into the standard content vector if necessary.

The similarity calculation model is used for calculating the similarity between a preset standard content vector and a vector to be evaluated, so that the similarity between reference answer data (corresponding to the preset standard content vector) and text data (corresponding to the vector to be evaluated) is represented according to the similarity, and grading is carried out on the speech data to be evaluated according to the similarity between the reference answer data and the text data. In this embodiment, the similarity calculation model is configured to calculate the similarity according to a cosine value between a preset standard content vector and a vector to be evaluated, and make the calculated similarity a value greater than 0.

The similarity calculation model can determine the similarity according to the cosine values of the two vectors, and the calculated similarity is a numerical value larger than 0, so that the purpose of accurately and efficiently determining the similarity is achieved, the influence of homophonic and heteromorphic characters on the accuracy of the similarity in the existing method for determining the similarity through text keywords on voice evaluation is avoided, and the problem that the calculated similarity has a negative value due to the fact that the cosine values have the negative value and the voice evaluation result is inaccurate is solved.

Step S103: and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule.

Wherein the evaluation result data is used for representing the expression ability level of the user. For example, in an expression training session, the evaluation result data may characterize the expression level of the user. Depending on the level of expressive power to be characterized, different evaluation parameters may be included in the evaluation result data. For example, in the aforementioned expression training course, the evaluation result data includes semantic score, dynamics score, tone score, and the like.

In a feasible manner, the preset voice scoring rules include a rule that the score in the evaluation result data is positively correlated with the similarity, that is, the higher the similarity is, the higher the score is, the better the expression representation capability is.

The speech evaluation method can be applied to an expression training course, and speech input by a user is evaluated, for example, speech data to be evaluated is converted into a vector to be evaluated, the similarity between the vector to be evaluated and a preset standard content vector is evaluated to obtain evaluation result data, and the expression ability of the user is represented through the evaluation result data, so that the user can clearly and conveniently know the expression level of the user, and the user is stimulated to continue learning and progress.

The speech evaluation method of the embodiment can be implemented by any suitable device with a data processing function, including: various terminal devices, servers, and the like.

Example two

Referring to fig. 2, a flowchart illustrating steps of a speech evaluation method according to a second embodiment of the present invention is shown.

In the present embodiment, the speech evaluation method is described by way of example as applied to an expression training course, particularly to an expression training course applied to a young child (for example, a child aged 3 to 8 years). Of course, in other embodiments, the speech evaluation method may be applied to any other suitable scenarios, for example, a training evaluation scenario for artificial intelligence devices, and the like, which is not limited in this embodiment.

The voice evaluation method of the embodiment comprises the following steps:

step S201: and generating a vector to be evaluated of the voice data to be evaluated according to the text data corresponding to the voice data to be evaluated.

In the expression training course, a group of pictures and/or videos can be displayed to the user, and the user describes the contents in the pictures and/or videos through a language, so that the expression ability of the user is trained. In order to enable a user to more accurately master the learning condition of the user, the voice data formed by the language expressed by the user can be evaluated, so that the user can more intuitively and clearly know the learning condition of the user, the user is urged to continue learning, and the learning enthusiasm is improved.

When evaluating the voice data of the user so as to embody the expression ability of the user, the expression ability can be represented by adopting a proper parameter, for example, a picture is displayed to the user, and the matching degree of the semantics of the voice data input by the user and the content of the displayed picture is judged by calculating the similarity of the text data corresponding to the voice data of the user and the reference answer data corresponding to the picture, so that the expression ability is represented.

In one possible way, when performing speech assessment, step S201 includes the following sub-steps:

substep 1: and recognizing the voice data to be evaluated by using the voice recognition model to generate text data corresponding to the voice data to be evaluated.

Optionally, the voice data to be evaluated may be obtained first; transcoding the voice data to be evaluated and generating transcoded voice data to be evaluated; and taking the transcoded voice data to be evaluated as the input of a voice recognition model, and generating text data corresponding to the voice data to be evaluated through the voice recognition model.

The speech data to be evaluated can be user voice collected by a recording device, or recording input by a user, or the like, or user speech data stored in a database is extracted as speech data to be evaluated.

After the voice data to be evaluated is obtained, if the voice data to be evaluated meets the format required by the voice recognition model, the voice data to be evaluated can be directly used as the input of the voice recognition model, and the text data corresponding to the voice data to be evaluated is generated through the voice recognition model.

And if the voice data to be evaluated does not meet the format required by the voice recognition model, transcoding the voice data to be evaluated and generating transcoded voice data to be evaluated. For example, if the speech data to be evaluated is in mp3 format, the sampling rate is 44100Hz, and the audio channel 2, if the speech format does not conform to the input format required by the speech recognition model, the speech data is transcoded, and the transcoding may be performed in any appropriate manner, for example, in ffmpeg manner, where the converted speech format is: wav format, sample rate 16000Hz, audio channel 2, 16 bit.

And the transcoded voice data to be evaluated is used as the input of a voice recognition model, and text data corresponding to the voice data to be evaluated is generated through the voice recognition model.

Those skilled in the art can recognize the speech data to be evaluated by using an appropriate speech recognition model according to the needs, which is not limited in this embodiment. For example, the speech recognition Model may be a speech recognition Model based on an HMM (Hidden Markov Model) and an N-gram Model, or may be a speech recognition Model that directly calls an existing speech recognition tool.

It is inevitable that voices with the same or similar pronunciation are recognized as different characters in the voice recognition process, such as "lam", "do", "horse", and the like. After the speech data to be evaluated is identified and the text data is generated, the problem that the pronunciation of the converted characters is the same as that of the speech data to be evaluated but the characters are different exists in the text data possibly. Especially for voice data of children of low ages, the accuracy of voice recognition is reduced due to the inevitable problems of unclear expression, pause and the like of users.

If the text similarity calculation is performed by directly using the text data and the reference answer data by using the existing text similarity calculation model, the similarity calculation is inaccurate due to the inaccuracy of the recognized characters, and the accuracy of subsequent speech evaluation is affected. This is because the existing similarity calculation models such as T-IDF (term frequency-inverse document frequency model), LSI (shallow semantic indexing), and LDA (term Dirichlet Allocation model) all perform similarity calculation by keyword matching, and the accuracy of similarity calculation of keyword matching is affected by the accuracy of recognized characters.

In order to overcome the defects and improve the speech recognition accuracy, a set of speech recognition models based on the young children can be trained through a machine learning model, and the trained speech recognition models are used for recognizing the speech data to be evaluated, so that more accurate text data can be obtained.

Substep 2: and vectorizing the text data through a text vector calculation model to generate a to-be-evaluated vector of the to-be-evaluated voice data.

The text vector calculation model is used for converting the text data into corresponding vectors to be evaluated so as to perform subsequent text similarity calculation.

Optionally, a manner of implementing this step may include:

preprocessing the text data, and generating result data according to a preprocessing result, wherein the result data comprises word segmentation data used for indicating a plurality of words in the text data; and taking the word segmentation data of each word segmentation as the input of a text vector calculation model, and generating a to-be-evaluated vector of the to-be-evaluated voice data through the text vector calculation model.

In one possible approach, the pre-processing includes dirty data removal processing, word segmentation processing, and stop word removal processing.

Based on this, preprocessing the text data, and generating result data according to the preprocessing result includes:

pretreatment step 1: and performing dirty data removal processing on the text data, and obtaining effective text data.

The dirty data may be data in which the text is empty and the useful information of the text is little enough for training, for example, the number of words in the text is less than a preset threshold, and the text may be considered as dirty data with little useful information if the sentence in the text lacks a subject and a predicate.

For example: text 1: ,. Text 1 is null data. Text 2: there is a little bear. The useful information in text 2 is very little. These data are all dirty data. The particular method of removing dirty data may be in any suitable manner, such as, for example, existing methods of removing dirty data.

A pretreatment step 2: and performing word segmentation processing on the effective text data, and obtaining word segmentation data of a plurality of words in the effective text data.

The word segmentation process may be performed in any suitable manner, for example, using a Hidden Markov Model (HMM) based machine learning word segmentation Model.

As "text 3: for example, there is a bear on the picture, and after word segmentation processing, word segmentation data is as follows: picture/upper/present/one/bear.

A pretreatment step 3: and performing stop word removal processing on the word segmentation data of the plurality of word segmentations to obtain result data.

The stop word refers to a word which is removed from the information and/or text in the information and/or text processing for saving the storage space and improving the processing efficiency. The stop word may be determined according to the need, for example, the stop word may be an assistant word or a mood word, etc., for example: "and", "ground", "o", etc., may also be other words, such as "has", "upper", etc.

Taking the word segmentation data of the text 3 as an example, after the stop word is removed, the following steps are performed: one bear is shown in the picture.

And after the result data is obtained, the word segmentation data of each word segmentation is used as the input of a text vector calculation model, and a to-be-evaluated vector of the to-be-evaluated voice data is generated through the text vector calculation model.

Wherein, the text vector calculation model may be a deep learning model, such as: word2vec model, doc2vec model, etc. In this embodiment, a text vector calculation model is taken as doc2vec for example, a structure diagram of the doc2vec model is shown in fig. 3, and the doc2vec model includes an input layer (input layer), a hidden layer (hidden layer), and an output layer (output layer). The input layer is used for acquiring training sample data; the hidden layer is used for vectorizing the training sample data; the output layer is used for outputting the result.

In order to better adapt to voice data of children of low age, the doc2vec model can be trained by using the voice data of the children of low age, so that a vector to be evaluated of the voice data to be evaluated generated by the trained doc2vec model is better.

The training of the doc2vec model may be implemented by a conventional training method, for example, by the following training process:

first, words in each piece of training text data are initialized to an N-dimensional vector, where N may be determined to be an appropriate value as needed. Preferably, the value range of N is: 30-200, can be adjusted as required.

For example, the word segmentation data of the text data is: { picture one bear playing }; the term initialization vector dimension is 4, then the initialized vector of each term is:

picture x 1: [1,0,0,0](ii) a Where picture X1 is used to indicate the word "picture" as the first input (i.e., X as shown in FIG. 3)₁)。

An x 2: [0,1,0,0](ii) a Where an X2 is used to indicate the word "a" as the second input (i.e., X as shown in FIG. 3)₂)。

Bear x 3: [0,0,1,0 ]; where bear x3 is used to indicate the word "bear" as the third input.

Play x 4: [0,0,0,1 ]; where play x4 is used to indicate the word "play" as the fourth input.

When training the doc2vec model, x1, x2 and x4 can be used as the input of the model to be trained, i.e. the input layer in fig. 3 is x1, x2, x4, and x3 is used as the correction reference.

During training, the result of the hidden layer (hidden layer) is calculated by setting up a training parameter matrix w of the doc2vec model. The matrix w is illustrated in fig. 3 as a box between an input layer and a hidden layer. In one possible approach, the matrix w is as follows:

the result v2 of the Hidden Layer is calculated by x2 and the matrix w as follows:

similarly, x1 and x4 are calculated with the W matrix to obtain the results of the Hidden Layer, v1 and v 4. And (3) averaging the calculated v1, v2 and v4 to determine the hidden layer.

Then, a matrix o between the hidden layer and the output layer is established, the matrix o is multiplied by the hidden layer, and then the probability is obtained by sofmax, for example, the calculation result of the output layer is: [0.23,0.03,0.62,0.12], value 3,0.62 max, then close to the true expected [0,0,1,0 ]. And performing parameter optimization according to the output result of the output layer and the real expectation of x3, for example, adjusting the matrixes to obtain an optimal model, and completing the training of the doc2vec model.

After the doc2vec model is trained, the word segmentation data of each word segmentation is used as the input of a text vector calculation model (the trained doc2vec model), and a to-be-evaluated vector of the to-be-evaluated voice data is generated through the text vector calculation model. Because the text vector calculation model is trained, the optimal parameters in the use scene for the young children are obtained, and the accuracy of generating the to-be-evaluated vector by using the text vector calculation model is better.

Step S202: and calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated, and enabling the calculated similarity to be a numerical value larger than 0.

The similarity calculated by the similarity calculation model can be regarded as the similarity between the reference answer data and the text data corresponding to the speech data to be evaluated, so that the speech data to be evaluated is scored according to the similarity. In this embodiment, the similarity calculation model is configured to calculate the similarity according to a cosine value between a preset standard content vector and a vector to be evaluated, and make the calculated similarity a value greater than 0.

The similarity calculation model can determine the similarity according to the cosine values of the two vectors, and the calculated similarity is a numerical value larger than 0, so that the purpose of accurately and efficiently determining the similarity is achieved, the influence of homophonic and heteromorphic characters existing in the existing method for determining the similarity through text keywords on the accuracy of the similarity is avoided when voice evaluation is performed, and the problems that the calculated similarity has a negative value due to the fact that the cosine values have the negative value and the result of the voice evaluation is inaccurate are solved.

In one feasible mode, a preset standard content vector and a vector to be evaluated are used as input of a similarity calculation model, and the similarity is calculated through the similarity calculation model.

Optionally, in order to improve the accuracy of the evaluation, the similarity is a numerical value greater than 0 and less than or equal to 1. Therefore, the problem that the similarity existing in the existing cosine value indication similarity has a negative value and is not beneficial to accurately embodying the expression ability is solved, and the subsequent scoring according to the similarity is facilitated, so that the scoring is simpler and more convenient.

Optionally, the similarity calculation model includes:

wherein x is_iFor indicating the vector to be evaluated, x_jThe score is used for indicating the standard content vector corresponding to the vector to be evaluated, and is used for indicating the similarity. The similarity calculated by the similarity calculation model linearly changes along with the cosine value, the larger the cosine value is, the closer the similarity is to 1.0 after conversion, the smaller the cosine value is, the closer the similarity is to 0.0 after conversion, and thus the value range of the similarity is 0.0-1.0, and the similarity can be conveniently used for subsequently generating evaluation result data, such as evaluation scores.

Of course, in other embodiments, the similarity calculation model includes score ═ e^cos(xi,xj)Wherein x is_iFor indicating the vector to be evaluated, x_jThe score is used for indicating the standard content vector corresponding to the vector to be evaluated, and is used for indicating the similarity.

Step S203: and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule.

The speech evaluation method can be applied to an expression training course, and by evaluating speech input by a user, for example, text data corresponding to speech data to be evaluated is converted into a vector to be evaluated, and the similarity between the vector to be evaluated and a preset standard content vector is evaluated to obtain evaluation result data, so that the expression ability of the user is represented by the evaluation result data, the user can clearly and conveniently know the expression level of the user, and the user is stimulated to continue learning and progress.

When the speech evaluation method is used for evaluating speech data to be evaluated, inconsistent text data recognized by speech are converted into numerical vectors through a semantic understanding technology to reduce errors, a similarity calculation model is used for calculating the similarity according to cosine values between the vectors to be evaluated and preset standard content vectors, the calculated similarity is a numerical value larger than 0, and the problem of low evaluation accuracy of the existing similarity calculation method during speech evaluation is solved. In addition, the problem that a negative value exists in similarity calculation is solved, so that the voice evaluation result is more accurate.

EXAMPLE III

According to an embodiment of the present invention, there is provided a computer storage medium storing: the instruction is used for generating a to-be-evaluated vector of the to-be-evaluated voice data according to the text data corresponding to the to-be-evaluated voice data; the command is used for calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated, and the calculated similarity is a numerical value larger than 0; and the instruction is used for generating and outputting the evaluation result data of the voice data to be evaluated according to the similarity and the preset voice scoring rule.

Optionally, the instruction for calculating the similarity between the preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and the similarity calculation model includes: and the instruction is used for calculating the similarity through the similarity calculation model by taking the preset standard content vector and the vector to be evaluated as the input of the similarity calculation model.

Alternatively, the similarity is a numerical value greater than 0 and equal to or less than 1.

Optionally, the similarity calculation model includes:

wherein x is_iFor indicating the vector to be evaluated, x_jThe score is used for indicating the standard content vector corresponding to the vector to be evaluated, and is used for indicating the similarity.

Optionally, the instruction for generating a to-be-evaluated vector of the to-be-evaluated voice data according to the text data corresponding to the to-be-evaluated voice data includes: the instruction is used for identifying the voice data to be evaluated by using the voice recognition model and generating text data corresponding to the voice data to be evaluated; and the instruction is used for vectorizing the text data through the text vector calculation model to generate a to-be-evaluated vector of the to-be-evaluated voice data.

Optionally, the instruction for recognizing the speech data to be evaluated by using the speech recognition model and generating text data corresponding to the speech data to be evaluated includes: the instruction is used for acquiring the voice data to be evaluated; the instruction is used for transcoding the voice data to be evaluated and generating transcoded voice data to be evaluated; and the instruction is used for taking the transcoded voice data to be evaluated as the input of the voice recognition model and generating text data corresponding to the voice data to be evaluated through the voice recognition model.

Optionally, the instruction for performing vectorization processing on the text data through the text vector calculation model to generate a to-be-evaluated vector of the to-be-evaluated speech data includes: instructions for preprocessing the text data and generating result data according to a preprocessing result, wherein the result data includes participle data for indicating a plurality of participles in the text data; and the command is used for taking the word segmentation data of each word segmentation as the input of the text vector calculation model and generating the to-be-evaluated vector of the to-be-evaluated voice data through the text vector calculation model.

Optionally, the preprocessing comprises dirty data removal processing, word segmentation processing and stop word removal processing; the instruction for preprocessing the text data and generating result data according to the preprocessing result comprises the following steps: instructions for performing dirty data removal processing on the text data and obtaining valid text data; the instruction is used for carrying out word segmentation processing on the effective text data and obtaining word segmentation data of a plurality of words in the effective text data; and the instruction is used for removing stop words from the participle data of the participles to obtain result data.

The instruction stored in the computer storage medium can convert inconsistent text data recognized by voice into numerical vectors through a semantic understanding technology to reduce errors when evaluating voice data to be evaluated, calculate similarity according to cosine values between the vector to be evaluated and a preset standard content vector by using a similarity calculation model, and enable the calculated similarity to be a numerical value larger than 0, so that the problem of low evaluation accuracy of the existing similarity calculation method when evaluating the voice data is solved. In addition, the problem that a negative value exists in similarity calculation is solved, so that the voice evaluation result is more accurate.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product that can be stored on a computer-readable storage medium including any mechanism for storing or transmitting information in a form readable by a computer (e.g., a computer). For example, a machine-readable medium includes Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage media, optical storage media, flash memory storage media, electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others, and the computer software product includes instructions for causing a computing device (which may be a personal computer, server, or network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus (device), or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A speech evaluation method, comprising:

acquiring the content of pictures and/or videos displayed by a user through language description as voice data to be evaluated;

generating a to-be-evaluated vector of the to-be-evaluated voice data according to text data corresponding to the to-be-evaluated voice data;

calculating the similarity between a preset standard content vector and the vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated, and enabling the calculated similarity to be a numerical value which is greater than 0 and less than or equal to 1, and the similarity calculation model comprises:

wherein x is_iFor indicating the vector to be evaluated, x_jFor indicating and awaiting evaluationThe standard content vector corresponding to the direction finding quantity is used for indicating the similarity;

and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule.

2. The method according to claim 1, wherein calculating the similarity between the preset standard content vector and the vector to be evaluated according to a preset standard content vector, the vector to be evaluated and a similarity calculation model comprises:

and taking a preset standard content vector and the vector to be evaluated as the input of the similarity calculation model, and calculating the similarity through the similarity calculation model.

3. The method according to claim 1, wherein the similarity is a numerical value greater than 0 and equal to or less than 1.

4. The method according to any one of claims 1 to 3, wherein the generating a to-be-evaluated vector of the to-be-evaluated voice data according to text data corresponding to the to-be-evaluated voice data comprises:

using a voice recognition model to recognize the voice data to be evaluated, and generating the text data corresponding to the voice data to be evaluated;

and vectorizing the text data through a text vector calculation model to generate a to-be-evaluated vector of the to-be-evaluated voice data.

5. The method according to claim 4, wherein the recognizing the speech data to be evaluated by using the speech recognition model to generate the text data corresponding to the speech data to be evaluated comprises:

acquiring the voice data to be evaluated;

transcoding the voice data to be evaluated, and generating transcoded voice data to be evaluated;

and taking the transcoded voice data to be evaluated as the input of the voice recognition model, and generating the text data corresponding to the voice data to be evaluated through the voice recognition model.

6. The method according to claim 4, wherein the vectorizing the text data through a text vector calculation model to generate a to-be-evaluated vector of the to-be-evaluated speech data includes:

preprocessing the text data and generating result data according to a preprocessing result, wherein the result data comprise word segmentation data used for indicating a plurality of word segmentations in the text data;

and taking the word segmentation data of each word segmentation as the input of a text vector calculation model, and generating a to-be-evaluated vector of the to-be-evaluated voice data through the text vector calculation model.

7. The method of claim 6, wherein the pre-processing comprises removing dirty data processing, participle processing, and removing stop word processing;

the preprocessing the text data and generating result data according to a preprocessing result comprises the following steps:

performing dirty data removal processing on the text data, and obtaining effective text data;

performing word segmentation processing on the effective text data, and obtaining word segmentation data of a plurality of words in the effective text data;

and performing stop word removal processing on the word segmentation data of the plurality of word segmentations to obtain the result data.

8. A computer storage medium, the computer storage medium having stored thereon: acquiring the content of pictures and/or videos displayed by a user through language description as voice data to be evaluated; the instruction is used for generating a to-be-evaluated vector of the to-be-evaluated voice data according to text data corresponding to the to-be-evaluated voice data;the method comprises the following steps of calculating the similarity between a preset standard content vector and a vector to be evaluated according to the preset standard content vector, the vector to be evaluated and a similarity calculation model, wherein the similarity calculation model is used for calculating the similarity according to the cosine values of the preset standard content vector and the vector to be evaluated, and enabling the calculated similarity to be a numerical value which is larger than 0 and smaller than or equal to 1, and the similarity calculation model comprises the following steps:

wherein x is_iFor indicating the vector to be evaluated, x_jThe score is used for indicating the similarity of the standard content vector corresponding to the vector to be evaluated; and generating and outputting evaluation result data of the voice data to be evaluated according to the similarity and a preset voice scoring rule.

9. The computer storage medium of claim 8, wherein the instructions for calculating the similarity between the preset standard content vector and the vector to be evaluated according to a preset standard content vector, the vector to be evaluated and a similarity calculation model comprise: and the instruction is used for calculating the similarity through the similarity calculation model by taking a preset standard content vector and the vector to be evaluated as the input of the similarity calculation model.