WO2023279631A1

WO2023279631A1 - Speech manuscript evaluation method and device

Info

Publication number: WO2023279631A1
Application number: PCT/CN2021/133041
Authority: WO
Inventors: 张�林; 王晔; 李东朔
Original assignee: 北京优幕科技有限责任公司
Priority date: 2021-07-06
Filing date: 2021-11-25
Publication date: 2023-01-12
Also published as: CN113255843B; CN113255843A

Abstract

Disclosed are a speech manuscript evaluation method and device, which relate to the technical field of information processing, and which mainly aim to solve the problems in conventional evaluation methods of large required sample data, few sample types and the effectiveness and fairness of evaluation results being poor. The technical solution comprises: acquiring a plurality of speech manuscripts; segmenting each speech manuscript into a plurality of sections; using a neural network model to identify all of the sections and a plurality of different preset questions, sequentially using each preset question and all of the sections as input data, the neural network model extracting feature data from the input data, and outputting sorting information of all of the sections for a preset question according to the feature data, the sorting information being used for representing the degree of recognition of each section for answering the preset question; and determining an evaluation result of each speech manuscript according to the sorting information of all of the sections for each preset question.

Description

Speech Evaluation Methods and Equipment

technical field

The embodiments of the present invention relate to the technical field of information processing, and in particular, to a method and device for evaluating speech drafts.

Background technique

Speech is a way of speaking and communicating in a special environment, and professional speech training requires repeated training. During the training process, scoring and evaluating in time to find deficiencies can speed up the training progress. Speech scoring includes facial expressions, speech speed, and speech content.

The most common scoring method for speech content is to establish a regression model, that is, to collect speech content with different scores by establishing a massive data set, use manual methods to design features or machine automatic feature extraction, calculate the contribution of each feature to the score, and extract effective features and establish relationships between features and scores. The training regression model is to extract features from the speech script dataset, establish the relationship between features and scores, and store them in the form of weight matrix. However, this method needs to rely on a large amount of data, and the samples need to cover each score segment, topic and other data, otherwise the scoring results will be randomly distributed, affecting the effectiveness and fairness of the entire scoring. In practical applications, there are only a few excellent samples at the beginning of speech training, and there is an extreme lack of samples with low and medium scores. And the same is true for other speech datasets in open data resources, leaving only the best cases of speech, which cannot be learned directly through transfer learning.

Contents of the invention

In view of this, the embodiment of the present invention provides a method and equipment for evaluating speech drafts, the main purpose of which is to solve the problems of large sample data, few types of samples, and poor validity and fairness of evaluation results in traditional evaluation methods.

In order to solve the above problems, the embodiments of the present invention mainly provide the following technical solutions:

In the first aspect, the embodiment of the present invention provides a speech evaluation method, the method comprising:

Get multiple speeches;

Divide each of the said speeches into several subsections;

Using a neural network model to identify all the subsections and multiple different preset questions, wherein each of the preset questions and all the subsections are used as input data in turn, and the neural network model extracts characteristic data from the input data, Outputting ranking information of all the subsections with respect to preset questions according to the feature data, which is used to indicate the approval degree of each of the subsections for answering the preset questions.

The evaluation results of each speech are determined according to the ranking information of all the subsections for each of the preset questions.

Optionally, before using the neural network model to identify all the subsections and multiple different preset questions, it also includes:

Acquiring a plurality of training data, each of which includes a plurality of sample answers and a preset question, and ranking information of each sample answer for the preset question;

Using the plurality of training data to train the neural network model, the neural network outputs sorting information according to a plurality of sample answers and a preset question, and optimizes according to the difference between the output sorting information and the sorting information in the training data Model parameters.

Optionally, obtaining multiple training data specifically includes:

Crawl the context and corresponding answers related to the preset questions from several specified webpages;

The ranking information is obtained according to the ranking of each answer content in the web page.

Optionally, determining the evaluation results of each speech according to the ranking information of each of the preset questions in all the sections specifically includes:

In the ranking information of all the subsections for each of the preset questions, the highest ranking information for each of the preset questions for each subsection belonging to the same speech is obtained;

An evaluation result for the speech is obtained according to the highest ranking information of the same speech.

Optionally, the ranking information corresponds to preset scores; the evaluation result is a score obtained according to each preset score.

Optionally, the speech is based on the text data obtained by performing speech recognition on the speech recording, and the length of the pause time of the speech is recorded during the speech recognition process; in the step of dividing each of the speeches into several subsections, The speech is divided into subsections according to the semantics of the speech and the length of pauses in speech speech.

Optionally, the ranking information includes empty ranking and/or tied ranking information.

Optionally, in the process of extracting feature data from the input data by the neural network model, an attention mechanism is used to process the text data from the section and the text data from the preset question, and based on the processed Feature data output ordering information.

In the second aspect, the embodiment of the present invention provides a speech evaluation device, which includes: at least one processor; and a memory connected to the at least one processor; wherein, the memory stores information that can be used by the An instruction executed by a processor, the instruction is executed by the at least one processor, so that the at least one processor executes the above method for evaluating a speech.

In a third aspect, an embodiment of the present invention provides a computer program product containing instructions, which, when run on a computer, cause the computer to execute the above method for evaluating a speech.

According to the speech evaluation method and equipment provided by the present invention, the evaluation task of the entire speech is converted into the evaluation of the degree of acceptance of questions and answers, and there is no need to provide a large number of speeches with different qualities as learning samples for the neural network. It is necessary to preset questions related to the topic of the speech, and prepare corresponding answers with different degrees of acceptance, then the neural network model can be trained, and then the evaluation of multiple speeches can be completed, which solves the problem of lack of samples in the existing technology. The problem of speech evaluation, and the accuracy of this program is high

Description of drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating the preferred embodiments and are not considered as limiting the embodiments of the present invention. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

Fig. 1 shows the flow chart of a kind of speech draft evaluation method in the embodiment of the present invention;

FIG. 2 shows a flow chart of another semantic evaluation method in an embodiment of the present invention;

Fig. 3 shows a schematic diagram of evaluation results for paragraphs obtained in the embodiment of the present invention;

Fig. 4 shows a schematic diagram of the working process of the neural network model in the embodiment of the present invention.

detailed description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

The present invention provides a speech evaluation method, which can be executed by electronic devices such as computers and servers, as shown in Figure 1, the method includes the following steps:

101. Get multiple speeches.

102. Divide each speech script into several subsections.

In the embodiment of the present invention, the subsection refers to a paragraph expressing a certain argument, which may be one natural paragraph, or may be multiple. Exemplary: If multiple natural paragraphs are talking about the advantages of the product, these multiple natural paragraphs should be regarded as one paragraph. Let C11...C1n represent the subsections of the first speech, C21...C2n represent the subsections of the second speech, and Cn1...Cnn represent the n subsections of the nth speech. The method of segmenting the speech may include but not limited to the following methods: using the existing semantic recognition technology, the whole manuscript is divided into several subsections according to the semantics of the text content. It is also possible to combine pauses and semantic formation into sections according to the speech recognition results, and the details are not limited.

103. Using a neural network model to identify all the subsections and multiple different preset questions, wherein each of the preset questions and all the subsections are used as input data in turn, and the neural network model extracts features from the input data data, outputting ranking information of all the subsections with respect to preset questions according to the characteristic data, which is used to indicate the approval degree of each of the subsections with respect to preset questions.

Recognition, which can also be interpreted as popularity, is learned by the neural network by identifying the training data. For example, when training a neural network, answers and questions can be prepared manually, and the ranking of these answers to the corresponding questions can be given manually, that is, the popularity/recognition of each answer; of course, data can also be migrated from other question-answer databases As the sample data for training the neural network. There are various methods for training the neural network, as well as ways to set and obtain sample data, which will be described in detail in subsequent embodiments. It can be seen that the neural network in this solution is not used to identify the shallow semantic relationship between the answer and the question, but to sort multiple answers based on the learned knowledge to simulate human thinking. Indicates that the higher its acceptance/popularity, the more people should like the answer. However, from a semantic point of view, the top-ranked answers are not necessarily more relevant to the preset question than the lower-ranked answers.

Fig. 4 shows a schematic diagram of the working process of the neural network. In this solution, the trained neural network is used to identify the segmented sections and output sorting information. Specifically, the preset question and the segmented subsections are used as the input of the network, such as question 1+C11...Cnn is used as input, and the ordering information of C11...Cnn for preset question 1 is output; similarly, question 2+C11... Cnn is used as input, and outputs C11...Cnn's ranking information for preset question 2. The preset question is set according to the content of the speech, and specifically related questions can be set according to the theme of the speech as the preset question. The higher the ranking of the subsections, the higher the acceptance/popularity for answering the preset questions.

104. Determine the evaluation results of each speech according to the ranking information of each of the preset questions in all the sections.

In one embodiment, the evaluation result is a score, and before the score is given according to the ranking, the corresponding relationship between the ranking and the score needs to be set, for example, the first ranking corresponds to 10 points, the second ranking corresponds to 8 points, and so on. Assuming that there are two predetermined questions, paragraphs C11...C1n have corresponding rankings for these two questions, and the highest ranking is taken here. For example, C11 ranks second (highest) for question 1, then the score is 8; C14 ranks third (highest) for question 2 with a score of 6. The method of calculating the total score of the two questions of the speech may include but not limited to the following: direct addition, weighted addition, weighted average, etc.

In other embodiments, the evaluation results can also be classification results, such as pre-setting categories such as "excellent", "good", "medium", and "poor", and classifying the ranking information output by the neural network to obtain The category the speech belongs to.

According to the speech evaluation method and equipment provided by the present invention, the evaluation task for the entire speech can be converted into the evaluation of the degree of acceptance of questions and answers, and there is no need to provide a large number of speeches with different qualities as learning samples for the neural network. You only need to preset questions related to the topic of the speech, and prepare corresponding answers with different degrees of acceptance to train the neural network model, and then complete the evaluation of multiple speeches, which solves the problem of lack of samples in the existing technology. Questions about the evaluation of speeches, and the accuracy of this program is high

The embodiment of the present invention also provides a kind of evaluation method of speech, as shown in Figure 2, described method comprises:

201. Acquire a plurality of training data, each of which includes a plurality of sample answers and a preset question, and ranking information of each sample answer with respect to the preset question.

In the embodiment of the present invention, the neural network model needs to be trained. A set of training data includes a preset question and corresponding multiple candidate answers. For example, the preset question is "What basic facts are described below?", and the corresponding candidate There are 40 answers, and the labels are the ranking of the 40 answers, which are given based on people's subjective wishes. The higher the ranking, the higher the quality of the answer, which can be interpreted as the higher the approval degree, the higher the popularity, and the people's preference for the sample answer with the higher ranking.

In the embodiment of the present invention, for multiple training data, the context related to the preset question and the corresponding answer content can be crawled in several specified webpages. It should be noted that when there are repeated answer content , the sample answers need to be obtained after merging; the ranking information is obtained according to the ranking of each answer content in the web page. Exemplary: The preset questions of a speech on a certain topic can find matching questions on the Internet (such as a question-and-answer system), and the answers to the questions can be obtained, and one of the answers is selected as the best answer by the questioner , other answers are ranked next, and can be ranked according to the interaction between the questioner and the answerer, such as popularity, etc. This ranking can be directly used as the label of the training data.

It should be noted that the number of sorting is not equal to the number of candidate answers, for example, there are 40 candidate answers, but the number of sorting is 10. Sorting can have ties and blanks. For example, the first-ranked answer is 0, the second-ranked answer is 0, and the third-ranked answer is 2 or 3, etc., so the sorting information includes empty rankings and/or tied rankings information.

The preset questions are set according to the content of the speech, and each assessment corresponds to a speech on the same topic. When presetting the questions, some questions can be set according to the theme of the speech. For example: "The following describes the basic Facts?", "What are the advantages compared to other products?", "What kind of benefits can users get?", "What did the protagonist do?", "What advanced deeds does the protagonist have?" and so on. Referring to the common practice of sorting answers in the FAQ system, we will select k candidate answers and put them together with the speech. The topK algorithm first finds top-n (exemplary: n=10) relevant documents for a specified question sentence, and the tf-idf or bm25 algorithm can be used. Next, the n documents are divided into paragraphs to obtain a candidate answer group much larger than n, from which the topk candidate answers (exemplary: k=40) are selected. What needs to be explained is the relevant documents of the above examples and the candidate answers. The quantities are examples only and are not intended to be limiting.

202. Use the plurality of training data to train the neural network model, the neural network outputs sorting information according to multiple sample answers and a preset question, and according to the relationship between the output sorting information and the sorting information in the training data Differentially optimize model parameters.

In the embodiment of the present invention, a two-layer feed-forward neural network is selected, the input is a preset question and several candidate answers, and the output is the labels of several candidate answers. The above-mentioned multiple training data are used to train the network, and the difference between the order of the network output and the label determines the loss, thereby optimizing the network parameters.

The network is expressed as f(xi)=ReLU(xiAT+b1)BT+b2

Where xi represents the features after attention, A∈Rm×d and B∈R1×m are optimized weight matrix parameters, b1∈Rm and b2∈R are linear bias vectors.

203. Obtain multiple speech scripts.

204. Divide each of the speeches into several subsections.

In the specific implementation process, the speech is based on the text data obtained by speech recognition of the speech recording, and the length of the pause time of the sound is recorded in the speech recognition process; in the step of dividing each speech into several subsections In , the speech is divided into subsections according to the semantics of the speech and the length of pauses in the speech recording.

205. Use the neural network model to identify all the subsections and multiple different preset questions, wherein each of the preset questions and all the subsections are used as input data in turn, and the neural network model extracts features from the input data data, outputting ranking information of all the subsections with respect to preset questions according to the characteristic data, which is used to indicate the recognition/popularity of each of the subsections for answering the preset questions.

The evaluation process of the neural network model is shown in Figure 3. In a preferred embodiment, in the process of extracting feature data from the input data by the neural network model, an attention mechanism (attention) is used to analyze the text data from the section and the text data from the The text data of the preset question is processed, and the sorting information is output based on the feature data obtained after processing.

206. From the ranking information of all the subsections for each of the preset questions, obtain the highest ranking information of each subsection belonging to the same speech for each of the preset questions.

In the embodiment of the present invention, each ranking information corresponds to a preset score, for example, the first ranking corresponds to 10 points, the second corresponds to 8 points, etc., and the corresponding relationship between the ranking and the score is specifically set according to actual needs.

207. Obtain an evaluation result for the speech according to the highest ranking information of the same speech.

In the embodiment of the present invention, for each question in the same speech, a most relevant answer should be found in the speech content to determine the score. Exemplary: Assuming that there are two predetermined questions, paragraphs C11...C1n Questions have their corresponding sorting conditions, and the highest ranking is taken here. For example, C11 ranks second (highest) for question 1, then the score is 8; C14 ranks third (highest) for question 2, and the score is 6.

It should be noted that the algorithm of the comprehensive score can be the direct sum of the scores of each question to obtain the comprehensive score, or the weight can be set according to the importance of the question, for example: as shown in Figure 1, the final score can be obtained by adding the scores of each answer 35+30. The weight can also be set according to the importance of the question. For example, if the weight of question 1 is set to 1.2 and the weight of question 2 is set to 0.8, the final score will be 1.2*35+0.8*30. The specific calculation method of comprehensive score and weight setting are not limited.

The embodiment of the present invention migrates the technology and data of the knowledge question answering system, and constructs a speech evaluation method that lacks data and is difficult to score. The evaluation model has high accuracy and strong interpretability, and can not only give scores, but also give corresponding A similar case for points.

An embodiment of the present invention also provides a speech evaluation device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores instructions that can be executed by the one processor , the instruction is executed by the at least one processor, so that the at least one processor executes the method described in the foregoing embodiments.

An embodiment of the present invention also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method described in the above-mentioned embodiments.

Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

Apparently, the above-mentioned embodiments are only examples for clear description, rather than limiting the implementation. For those of ordinary skill in the art, other changes or changes in different forms can be made on the basis of the above description. It is not necessary and impossible to exhaustively list all the implementation manners here. And the obvious changes or changes derived therefrom are still within the scope of protection of the present invention.

Claims

A speech evaluation method, characterized in that it includes:

Get multiple speeches;

Divide each of the said speeches into several subsections;

Using a neural network model to identify all the subsections and a plurality of different preset questions, wherein each of the preset questions and all the subsections are used as input data in turn, and the neural network model extracts characteristic data from the input data, Outputting the sorting information of all the subsections with respect to preset questions according to the feature data, which is used to indicate the approval degree of each of the subsections for answering preset questions;

The evaluation results of each speech are determined according to the ranking information of all the subsections for each of the preset questions.
The method according to claim 1, wherein, before using the neural network model to identify all the subsections and a plurality of different preset questions, it also includes:

Acquiring a plurality of training data, each of which includes a plurality of sample answers and a preset question, and ranking information of each sample answer for the preset question;

Using the plurality of training data to train the neural network model, the neural network outputs sorting information according to a plurality of sample answers and a preset question, and optimizes according to the difference between the output sorting information and the sorting information in the training data Model parameters.
The method according to claim 2, wherein obtaining a plurality of training data specifically comprises:

Crawl the context and corresponding answers related to the preset questions from several specified webpages;

The ranking information is obtained according to the ranking of each answer content in the web page.
The method according to any one of claims 1-3, wherein determining the evaluation results of each speech according to the ranking information of each of the preset questions in all the sections specifically includes:

In the ranking information of all the subsections for each of the preset questions, the highest ranking information for each of the preset questions for each subsection belonging to the same speech is obtained;

An evaluation result for the speech is obtained according to the highest ranking information of the same speech.
The method according to claim 4, wherein the ranking information corresponds to preset scores; and the evaluation result is a score obtained according to each preset score.
The method according to claim 1, wherein the speech script is based on the text data obtained by performing speech recognition on the speech recording, and records the pause time length of the speech during the speech recognition process;

In the step of dividing each of the speeches into several subsections, the speeches are divided into several subsections according to the semantics of the speeches and the length of pauses in the speech speech.
The method according to claim 1 or 2, wherein the ranking information includes empty ranking and/or parallel ranking information.
The method according to claim 1 or 2, wherein, in the process of extracting feature data from the input data by the neural network model, an attention mechanism is used to analyze the text data from the section and the text data from the preset question. Text data is processed, and sorting information is output based on the processed feature data.
A speech evaluation device, characterized in that it includes: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores instructions that can be executed by the one processor, so The instructions are executed by the at least one processor, so that the at least one processor performs the method according to any one of claims 1-8.