CN114187544A

CN114187544A - College English speaking multi-mode automatic scoring method

Info

Publication number: CN114187544A
Application number: CN202111447603.2A
Authority: CN
Inventors: 黄玲毅; 林和志; 郭洋洋; 姚舜禹; 许智军; 陈勇; 郑超茹; 黄联芬
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-15

Abstract

The invention provides a multi-mode automatic scoring method, medium and equipment for college English speaking, wherein the method comprises the following steps: acquiring historical speech data; extracting text characteristics, audio characteristics and video characteristics, and training a model to obtain a language use scoring sub-model, a speech expression scoring sub-model and a non-language scoring sub-model; generating a fourth data set according to the output of the three sub-models and the comprehensive score; training the model to obtain a multi-mode fusion learning model; acquiring a speech video to be scored, extracting corresponding text features, audio features and video features, and outputting corresponding single scoring through three submodels; inputting the single scoring into a multi-mode fusion learning model so as to output a final scoring result corresponding to the speech video to be scored through the multi-mode fusion learning model; the method can perform multi-mode scoring on the English speech, so that the scoring accuracy and the scoring efficiency are improved; meanwhile, the cost required by English speech scoring is reduced.

Description

College English speaking multi-mode automatic scoring method

Technical Field

The invention relates to the technical field of deep learning, in particular to a multimodality automatic scoring method for college English speaking.

Background

College english speaking is an interactive activity characterized by multiple modalities. During the speech process, the speaker is required to use speech modality and non-speech modality to cooperate with each other.

In the related art, when rating an english speech, it is mostly only from a single modality or by direct manual rating. Among them, the single-mode score is, for example: extracting voice in the speech process, and further scoring the speech according to the voice; or acquiring the speech text and grading the speech text. However, the scoring methods of these methods are too single, resulting in inaccurate final scoring results. Manual scoring is often susceptible to subjective influence, so that the scoring result is unstable; moreover, manual scoring requires a lot of time and effort wasted on the personnel, and is costly.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the art described above. Therefore, one purpose of the invention is to provide a multimodality automatic scoring method for college English speeches, which can perform multimodality scoring on English speeches, and improve scoring accuracy and scoring efficiency; meanwhile, the cost required by English speech scoring is reduced.

A second object of the invention is to propose a computer-readable storage medium.

A third object of the invention is to propose a computer device.

In order to achieve the above purpose, an embodiment of a first aspect of the present invention provides a multimodality automatic scoring method for a college english speaking, including the following steps: acquiring historical speech data, wherein the historical speech data comprises speech videos and manual scoring results corresponding to the speech videos, and the manual scoring results comprise language use scores, speech expression scores, non-language scores and comprehensive scores; extracting text features, audio features and video features corresponding to the lecture video, generating a first data set according to the text features and the language usage scores, generating a second data set according to the audio features and the speech expression scores, and generating a third data set according to the video features and the non-language scores; training the model according to the first data set to obtain a language use scoring submodel, training the model according to the second data set to obtain a speech expression scoring submodel, and training the model according to the third data set to obtain a non-language scoring submodel; acquiring an output result of the language use scoring submodel, an output result of the speech expression scoring submodel and an output result of the non-language scoring submodel, and generating a fourth data set according to the output result of the language use scoring submodel, the output result of the speech expression scoring submodel, the output result of the non-language scoring submodel and the comprehensive score; training a model according to the fourth data set to obtain a multi-mode fusion learning model; acquiring a speech video to be scored, extracting text features, audio features and video features corresponding to the speech video to be scored, and respectively inputting the text features, the audio features and the video features corresponding to the speech video to be scored into the language usage scoring sub-model, the speech expression scoring sub-model and the non-language scoring sub-model so as to output corresponding single scoring through the language usage scoring sub-model, the speech expression scoring sub-model and the non-language scoring sub-model; and inputting the single scoring into the multi-mode fusion learning model so as to output a final scoring result corresponding to the lecture video to be scored through the multi-mode fusion learning model.

According to the college English speech multi-mode automatic scoring method, historical speech data are obtained, wherein the historical speech data comprise speech videos and manual scoring results corresponding to the speech videos, and the manual scoring results comprise language use scores, speech expression scores, non-language scores and comprehensive scores; secondly, extracting text features, audio features and video features corresponding to the lecture video, generating a first data set according to the text features and the language usage scores, generating a second data set according to the audio features and the speech expression scores, and generating a third data set according to the video features and the non-language scores; then, training a model according to the first data set to obtain a language use scoring submodel, training the model according to the second data set to obtain a speech expression scoring submodel, and training the model according to the third data set to obtain a non-language scoring submodel; then, obtaining an output result of the language use scoring submodel, an output result of the speech expression scoring submodel and an output result of the non-language scoring submodel, and generating a fourth data set according to the output result of the language use scoring submodel, the output result of the speech expression scoring submodel, the output result of the non-language scoring submodel and the comprehensive score; then, training a model according to the fourth data set to obtain a multi-mode fusion learning model; secondly, obtaining a speech video to be scored, extracting text features, audio features and video features corresponding to the speech video to be scored, and inputting the text features, the audio features and the video features corresponding to the speech video to be scored into the language usage scoring sub-model, the speech expression scoring sub-model and the non-language scoring sub-model respectively so as to output corresponding single scoring through the language usage scoring sub-model, the speech expression scoring sub-model and the non-language scoring sub-model; then, inputting the single scoring into the multi-mode fusion learning model so as to output a final scoring result corresponding to the lecture video to be scored through the multi-mode fusion learning model; therefore, multi-mode grading of the English speech is realized, and the grading accuracy and the grading efficiency are improved; meanwhile, the cost required by English speech scoring is reduced.

In addition, the multimodality automatic scoring method for the college english speaking proposed by the above embodiment of the present invention may further have the following additional technical features: optionally, the text feature, the audio feature and the video feature each comprise a time series feature and a statistical feature. Optionally, the first data set includes a first training set and a first testing set, wherein training of the model according to the first data set to obtain the language usage scoring submodel includes: inputting the text features and corresponding language usage scores in the first training set into an initial LR/RR, RF, SVR model; coarse adjustment of the super-parameters is carried out by adopting K-fold cross validation, fine adjustment of the super-parameters is carried out in a grid searching mode, and LR/RR, RF and SVR models corresponding to text features are obtained; and respectively inputting the first test set into LR/RR, RF and SVR models corresponding to the text features for performance evaluation, and determining a final language usage scoring model according to the evaluation result.

Optionally, the inputting the first test set into LR/RR, RF, SVR models corresponding to the text features respectively for performance evaluation includes: inputting the first test set into the LR/RR, RF and SVR models corresponding to the text features respectively, and outputting corresponding results through the LR/RR, RF and SVR models corresponding to the text features; and respectively calculating the scoring accuracy of the LR/RR model, the RF model and the SVR model according to corresponding results, and respectively calculating the average absolute error of the LR/RR model, the RF model and the SVR model so as to select the optimal submodel according to the scoring accuracy and the average absolute error.

Optionally, extracting text features corresponding to the lecture video includes: performing voice recognition on the speech video to acquire a text corresponding to the speech video; performing word segmentation on the text, counting the number of words and the number of sentences according to word segmentation results, performing part-of-speech tagging on the words in the text through a tagging device, and acquiring the number of grammar errors of the text through a grammar proofreading algorithm; and extracting a word vector matrix corresponding to the text by using a word bag method and a language model to acquire context information.

Optionally, the audio features include: short-term energy, fundamental frequency, voice intensity, number of syllables, speech rate, pause duration, average pause duration, pause times, utterance duration, average speech stream length, and intonation.

Optionally, the lecture video is a depth video, wherein extracting video features corresponding to the lecture video includes: acquiring human body joint point position information corresponding to the lecture video, and generating video characteristics according to the human body joint point position information; the video features include bounding box features, curvature features, energy features, symmetry features, spatial features, temporal features, acceleration features, head orientation estimates, and gaze estimates.

In order to achieve the above object, a second aspect of the present invention provides a computer-readable storage medium, on which a college english speaking multimodal automatic scoring program is stored, which, when executed by a processor, implements the college english speaking multimodal automatic scoring method as described above.

According to the computer-readable storage medium of the embodiment of the invention, the college English speaking multi-mode automatic scoring program is stored, so that the processor can realize the college English speaking multi-mode automatic scoring method when executing the college English speaking multi-mode automatic scoring program, thereby realizing multi-mode scoring of English speaking and improving the scoring accuracy and efficiency; meanwhile, the cost required by English speech scoring is reduced.

In order to achieve the above object, a third embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the college english speaking multimodal automatic scoring method as described above.

According to the computer equipment provided by the embodiment of the invention, the college English speaking multi-mode automatic scoring program is stored through the memory, so that the processor can realize the college English speaking multi-mode automatic scoring method when executing the college English speaking multi-mode automatic scoring program, thus the English speaking multi-mode scoring is realized, and the scoring accuracy and the scoring efficiency are improved; meanwhile, the cost required by English speech scoring is reduced.

Drawings

Fig. 1 is a schematic flow chart of a multi-modal automatic scoring method for college english speaking according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In the related art, when rating an english speech, it is mostly only from a single modality or by direct manual rating. The accuracy is lower, and extravagant manpower and materials. According to the college English speech multi-mode automatic scoring method, historical speech data are obtained, wherein the historical speech data comprise speech videos and manual scoring results corresponding to the speech videos, and the manual scoring results comprise language use scores, speech expression scores, non-language scores and comprehensive scores; secondly, extracting text features, audio features and video features corresponding to the lecture video, generating a first data set according to the text features and the language usage scores, generating a second data set according to the audio features and the speech expression scores, and generating a third data set according to the video features and the non-language scores; then, training a model according to the first data set to obtain a language use scoring submodel, training the model according to the second data set to obtain a speech expression scoring submodel, and training the model according to the third data set to obtain a non-language scoring submodel; then, obtaining an output result of the language use scoring submodel, an output result of the speech expression scoring submodel and an output result of the non-language scoring submodel, and generating a fourth data set according to the output result of the language use scoring submodel, the output result of the speech expression scoring submodel, the output result of the non-language scoring submodel and the comprehensive score; then, training a model according to the fourth data set to obtain a multi-mode fusion learning model; secondly, obtaining a speech video to be scored, extracting text features, audio features and video features corresponding to the speech video to be scored, and inputting the text features, the audio features and the video features corresponding to the speech video to be scored into the language usage scoring sub-model, the speech expression scoring sub-model and the non-language scoring sub-model respectively so as to output corresponding single scoring through the language usage scoring sub-model, the speech expression scoring sub-model and the non-language scoring sub-model; then, inputting the single scoring into the multi-mode fusion learning model so as to output a final scoring result corresponding to the lecture video to be scored through the multi-mode fusion learning model; therefore, multi-mode grading of the English speech is realized, and the grading accuracy and the grading efficiency are improved; meanwhile, the cost required by English speech scoring is reduced.

In order to better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

Fig. 1 is a schematic flow chart of a multi-modal automatic scoring method for a college english speaking according to an embodiment of the present invention, and as shown in fig. 1, the multi-modal automatic scoring method for the college english speaking includes the following steps:

s101, historical speech data are obtained, wherein the historical speech data comprise speech videos and manual scoring results corresponding to the speech videos, and the manual scoring results comprise language usage scores, speech expression scores, non-language scores and comprehensive scores.

That is, the lecture process is recorded to obtain historical lecture data for subsequent training of the automatic scoring model based on the historical lecture data. The historical speech data comprises speech videos and manual scoring results corresponding to the speech videos, and the manual scoring results comprise language use scores, speech expression scores, non-language scores and comprehensive scores.

As an example, obtaining historical speech data through a data acquisition device; the data acquisition device includes a tele-camera (e.g., microsoft Azure Kinect camera); to obtain the speaker's facial expression, the speaker's 2D video and/or the 3D whole-body video including depth; meanwhile, the data acquisition equipment further comprises an audio collector, and the audio collector is used for acquiring the speech audio.

S102, extracting text features, audio features and video features corresponding to the lecture video, generating a first data set according to the text features and the language use scores, generating a second data set according to the audio features and the language expression scores, and generating a third data set according to the video features and the non-language scores.

S103, training the model according to the first data set to obtain a language use scoring submodel, training the model according to the second data set to obtain a speech expression scoring submodel, and training the model according to the third data set to obtain a non-language scoring submodel.

That is, when the speech is manually scored, the manual scoring includes the speech use item scoring, the speech expression item scoring, the non-speech item scoring and the comprehensive scoring; furthermore, training can be performed according to the single characteristics and the corresponding scores to obtain three submodels, namely a language use scoring submodel, a speech expression scoring submodel and a non-language scoring submodel.

In some embodiments, the text features, audio features, and video features each include time series features and statistical features.

In some embodiments, the first data set includes a first training set and a first testing set, wherein training of the model from the first data set to derive the language usage scoring submodel includes: inputting the text features and the corresponding language usage scores in the first training set into the initial LR/RR, RF and SVR models; coarse adjustment of the super-parameters is carried out by adopting K-fold cross validation, fine adjustment of the super-parameters is carried out in a grid searching mode, and LR/RR, RF and SVR models corresponding to text features are obtained; and respectively inputting the first test set into LR/RR, RF and SVR models corresponding to text features for performance evaluation, and determining a final language use scoring model according to an evaluation result.

It should be noted that the language uses the scoring submodel as one of the three submodels, and the training mode is also suitable for the speech expression scoring submodel and the non-language scoring submodel; that is to say, when the speech expression scoring submodel is performed, the second data set is divided into a second training set and a second testing set, so that training is performed through the second training set to obtain LR/RR, RF and SVR models corresponding to voice features, and parameter adjustment is performed through the second testing set to obtain an optimal submodel, wherein the optimal submodel is the final speech expression scoring submodel; it is understood that the training process of the non-language scoring submodel is also the same, and will not be described herein.

There are various ways to select the optimal submodel.

In some embodiments, the first test set is respectively input into LR/RR, RF, SVR models corresponding to text features for performance evaluation, including: respectively inputting the first test set into LR/RR, RF and SVR models corresponding to the text features, and outputting corresponding results through the LR/RR, RF and SVR models corresponding to the text features; and respectively calculating the scoring accuracy of the LR/RR model, the RF model and the SVR model according to corresponding results, and respectively calculating the average absolute error of the LR/RR model, the RF model and the SVR model so as to select the optimal submodel according to the scoring accuracy and the average absolute error.

The text feature, the voice feature and the video feature can be defined in various ways.

In some embodiments, extracting the text feature corresponding to the lecture video includes: performing voice recognition on the speech video to acquire a text corresponding to the speech video; performing word segmentation on the text, counting the number of words and the number of sentences according to word segmentation results, performing part-of-speech tagging on the words in the text through a tagging device, and acquiring the number of grammar errors of the text through a grammar proofreading algorithm; and extracting a word vector matrix corresponding to the text by using a word bag method and a language model to acquire context information.

As an example, firstly, performing speech recognition to obtain a text corresponding to a speech video, performing word segmentation on the text, and performing statistics to obtain the number of words and the number of sentences; then, using an N-gram annotator to perform part-of-speech annotation to obtain the part-of-speech corresponding to the word; then, acquiring a grammar error number by adopting a LanguageTool algorithm; then, extracting a word vector matrix by combining a word bag method with an N-gram model to obtain context information; thus, the information obtained by the above operation process can be used as a text feature.

In some embodiments, the audio features include: short-term energy, fundamental frequency, voice intensity, number of syllables, speech rate, pause duration, average pause duration, pause times, utterance duration, average speech stream length, and intonation. To characterize the speech fluency of the speaker by the average speech stream length.

In some embodiments, the lecture video is a depth video, wherein extracting video features corresponding to the lecture video includes: acquiring human body joint point position information corresponding to a speech video, and generating video characteristics according to the human body joint point position information; the video features include bounding box features, curvature features, energy features, symmetry features, spatial features, temporal features, acceleration features, head orientation estimates, and gaze estimates.

As an example, the bounding box feature represents a normalized volume of a smallest parallelepiped that encloses the speaker's body. Characterizing the degree of "spread out" of the speaker's body.

The curvature characteristics represent the fluency and the naturalness of the movement. A speaker suddenly makes a motion or moves rigidly, which indicates that the speaker is not natural and fluent enough, and preferably, the curvature characteristic is the curvature of the hand trajectory, which represents the fluent degree of motion. When the motion track is smooth, the curvature of the three-dimensional motion track is close to 0.

The energy characteristics are the representation of the activity of the lecturer during the lecture process. Is calculated from the velocity and corresponding mass of each limb segment of the body. The mass center and the mass of the human limb segment are obtained and obtained according to the proportional relation between the mass center position of the human limb segment in the solution and the positions of the joint points at the two ends of the limb segment and the proportional relation between the human limb segment and the body weight.

The symmetry characteristic is that the spatial symmetry of the hand is calculated according to a vertical axis and a horizontal axis, and it can be understood that when a person is in emotion, the whole body movement has obvious lateral asymmetry, and the system extracts the symmetry of the gesture to represent whether the limbs are relaxed and the relationship between the limbs and the emotion expression.

The spatial characteristics refer to whether a speaker has a gesture or not and whether the gesture motion condition is represented by the spatial characteristics, and the spatial characteristics are represented by the distance between a hand and a body, the distance between an arm and the body and the distance between two hands.

The temporal features are first derivatives of the spatial features and the acceleration features are second derivatives of the spatial features. Since the lecture joint trajectory is a scatter, the difference is used for calculation.

Head orientation estimation and sight estimation are used for representing that a presenter communicates with an audience by using own eye spirit, infecting the audience through proper expression deduction, and extracting the head orientation and eye spirit watching direction to measure the attention condition of the presenter and the contact degree of the presenter and the audience, and Openface is used for acquiring the head orientation and eye spirit watching direction.

In addition, from the time series characteristics, the average value, the peak value, the skewness, the variance, the root mean square, the kurtosis and the like of each time series of each sample can be extracted to represent the statistical characteristics of each time series, and the correlation between the characteristics and the speech score can be analyzed, so that the characteristics with weak correlation and no correlation can be removed. Under the condition of extracting a large number of multi-model features, a large number of redundant information is easy to generate, so that feature selection can be performed by using PCA of truncated singular value decomposition.

And S104, acquiring an output result of the language use scoring sub-model, an output result of the speech expression scoring sub-model and an output result of the non-language scoring sub-model, and generating a fourth data set according to the output result of the language use scoring sub-model, the output result of the speech expression scoring sub-model, the output result of the non-language scoring sub-model and the comprehensive score.

And S105, training the model according to the fourth data set to obtain the multi-mode fusion learning model.

That is, training data is generated from the results of the three submodels and the composite score among the manual scores, and training of the model is performed by the training data to obtain the multi-modal fusion learning model.

And S106, acquiring the speech video to be scored, extracting text features, audio features and video features corresponding to the speech video to be scored, and respectively inputting the text features, the audio features and the video features corresponding to the speech video to be scored into the language usage scoring submodel, the speech expression scoring submodel and the non-language scoring submodel so as to output corresponding single scoring through the language usage scoring submodel, the speech expression scoring submodel and the non-language scoring submodel.

And S107, inputting the single item scores into the multi-mode fusion learning model so as to output the final scoring result corresponding to the speech video to be scored through the multi-mode fusion learning model.

In summary, according to the multimode automatic scoring method for the college english speaking, first, historical speech data is obtained, where the historical speech data includes speech videos and manual scoring results corresponding to the speech videos, and the manual scoring results include language usage scores, speech expression scores, non-language scores and comprehensive scores; secondly, extracting text features, audio features and video features corresponding to the lecture video, generating a first data set according to the text features and the language usage scores, generating a second data set according to the audio features and the speech expression scores, and generating a third data set according to the video features and the non-language scores; then, training a model according to the first data set to obtain a language use scoring submodel, training the model according to the second data set to obtain a speech expression scoring submodel, and training the model according to the third data set to obtain a non-language scoring submodel; then, obtaining an output result of the language use scoring submodel, an output result of the speech expression scoring submodel and an output result of the non-language scoring submodel, and generating a fourth data set according to the output result of the language use scoring submodel, the output result of the speech expression scoring submodel, the output result of the non-language scoring submodel and the comprehensive score; then, training a model according to the fourth data set to obtain a multi-mode fusion learning model; secondly, obtaining a speech video to be scored, extracting text features, audio features and video features corresponding to the speech video to be scored, and inputting the text features, the audio features and the video features corresponding to the speech video to be scored into the language usage scoring sub-model, the speech expression scoring sub-model and the non-language scoring sub-model respectively so as to output corresponding single scoring through the language usage scoring sub-model, the speech expression scoring sub-model and the non-language scoring sub-model; then, inputting the single scoring into the multi-mode fusion learning model so as to output a final scoring result corresponding to the lecture video to be scored through the multi-mode fusion learning model; therefore, multi-mode grading of the English speech is realized, and the grading accuracy and the grading efficiency are improved; meanwhile, the cost required by English speech scoring is reduced.

In order to implement the above embodiments, a second aspect embodiment of the present invention proposes a computer readable storage medium, on which a college english speaking multimodal automatic scoring program is stored, which when executed by a processor implements the college english speaking multimodal automatic scoring method as described above.

In order to implement the foregoing embodiments, a third aspect of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the college english speaking multimodal automatic scoring method as described above.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

In the description of the present invention, it is to be understood that the terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above should not be understood to necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A multimodality automatic scoring method for college English speaking is characterized by comprising the following steps:

acquiring historical speech data, wherein the historical speech data comprises speech videos and manual scoring results corresponding to the speech videos, and the manual scoring results comprise language use scores, speech expression scores, non-language scores and comprehensive scores;

extracting text features, audio features and video features corresponding to the lecture video, generating a first data set according to the text features and the language usage scores, generating a second data set according to the audio features and the speech expression scores, and generating a third data set according to the video features and the non-language scores;

training the model according to the first data set to obtain a language use scoring submodel, training the model according to the second data set to obtain a speech expression scoring submodel, and training the model according to the third data set to obtain a non-language scoring submodel;

acquiring an output result of the language use scoring submodel, an output result of the speech expression scoring submodel and an output result of the non-language scoring submodel, and generating a fourth data set according to the output result of the language use scoring submodel, the output result of the speech expression scoring submodel, the output result of the non-language scoring submodel and the comprehensive score;

training a model according to the fourth data set to obtain a multi-mode fusion learning model;

acquiring a speech video to be scored, extracting text features, audio features and video features corresponding to the speech video to be scored, and respectively inputting the text features, the audio features and the video features corresponding to the speech video to be scored into the language usage scoring sub-model, the speech expression scoring sub-model and the non-language scoring sub-model so as to output corresponding single scoring through the language usage scoring sub-model, the speech expression scoring sub-model and the non-language scoring sub-model;

and inputting the single scoring into the multi-mode fusion learning model so as to output a final scoring result corresponding to the lecture video to be scored through the multi-mode fusion learning model.

2. The college english speaking multimodal automatic scoring method according to claim 1, wherein the text features, audio features and video features each include time series features and statistical features.

3. The college english speaking multi-modal automatic scoring method according to claim 1, wherein the first data set includes a first training set and a first testing set, and wherein training of the model based on the first data set to obtain the language usage scoring submodel comprises:

inputting the text features and corresponding language usage scores in the first training set into an initial LR/RR, RF, SVR model;

coarse adjustment of the super-parameters is carried out by adopting K-fold cross validation, fine adjustment of the super-parameters is carried out in a grid searching mode, and LR/RR, RF and SVR models corresponding to text features are obtained;

and respectively inputting the first test set into LR/RR, RF and SVR models corresponding to the text features for performance evaluation, and determining a final language usage scoring model according to the evaluation result.

4. The college english speaking multi-modal automatic scoring method according to claim 3, wherein the inputting the first test set into the LR/RR, RF, SVR models corresponding to the text features for performance evaluation comprises:

inputting the first test set into the LR/RR, RF and SVR models corresponding to the text features respectively, and outputting corresponding results through the LR/RR, RF and SVR models corresponding to the text features;

and respectively calculating the scoring accuracy of the LR/RR model, the RF model and the SVR model according to corresponding results, and respectively calculating the average absolute error of the LR/RR model, the RF model and the SVR model so as to select the optimal submodel according to the scoring accuracy and the average absolute error.

5. The multi-modal automatic scoring method for college english speaking according to claim 1, wherein extracting the text feature corresponding to the video of the speaking comprises:

performing voice recognition on the speech video to acquire a text corresponding to the speech video;

performing word segmentation on the text, counting the number of words and the number of sentences according to word segmentation results, performing part-of-speech tagging on the words in the text through a tagging device, and acquiring the number of grammar errors of the text through a grammar proofreading algorithm;

and extracting a word vector matrix corresponding to the text by using a word bag method and a language model to acquire context information.

6. The college english speaking multimodal automatic scoring method according to claim 1, characterized in that the audio features include: short-term energy, fundamental frequency, voice intensity, number of syllables, speech rate, pause duration, average pause duration, pause times, utterance duration, average speech stream length, and intonation.

7. The multi-modal automatic scoring method for english speaking at university according to claim 1, wherein the video of speaking is a depth video, and wherein extracting the video features corresponding to the video of speaking comprises:

acquiring human body joint point position information corresponding to the lecture video, and generating video characteristics according to the human body joint point position information;

the video features include bounding box features, curvature features, energy features, symmetry features, spatial features, temporal features, acceleration features, head orientation estimates, and gaze estimates.

8. A computer-readable storage medium, on which a college english speaking multi-modal automatic scoring program is stored, which when executed by a processor implements the college english speaking multi-modal automatic scoring method according to any one of claims 1 to 7.

9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the college english speaking multimodal automatic scoring method according to any one of claims 1 to 7 when executing the program.