CN115905475A

CN115905475A - Answer scoring method, model training method, device, storage medium and equipment

Info

Publication number: CN115905475A
Application number: CN202211672686.XA
Authority: CN
Inventors: 朱磊; 李�浩; 盛志超
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-04-04

Abstract

The application discloses an answer scoring method, a model training device, a storage medium and equipment. The method comprises the following steps: the method comprises the steps of obtaining voice of answering by an examinee in a spoken language test, determining a text sequence of answering a full text by the examinee according to the voice, determining a full text voice level representation of answering by the examinee according to the voice and the phoneme sequence, determining a full text language level representation of answering by the examinee according to the text sequence, obtaining a plurality of model essay key points of the spoken language test, determining a global key point representation of answering by the examinee according to the plurality of model essay key points and the text sequence, and scoring the examinee according to the full text voice level representation, the full text language level representation and the global key point representation to obtain a score of answering by the examinee.

Description

Answer scoring method, model training method, device, storage medium and equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an answer scoring method, a model training method, an answer scoring device, a computer readable storage medium and computer equipment.

Background

In recent years, with the development and progress of artificial intelligence technology, artificial intelligence technology is used in more and more scenes, such as a spoken language evaluation scene relating to a spoken language examination. In a spoken language evaluation scene such as a Chinese and high-school English spoken language examination, manual scoring is replaced by current machine scoring in a large batch. The comprehensive pronunciation level of the examinee is examined by looking at the topic in the English spoken language, and the evaluation of the spoken topic in the English spoken language is judged by 4 aspects of the pronunciation level, the completeness of the answering principal point, the language level of the answering content, the pronunciation fluency and the like of the examinee.

While machine scoring is generally superior to manual scoring, machine scoring still suffers from some off-spectrum scoring errors. For example, the examinee answers a separation question, or the content only matches the keywords but the key point is wrong/has low similarity, the examinee expresses that the keywords are similar to the partial keywords of the content of the examination picture, but the expression meanings are different, and the phenomenon of over-high machine scoring occurs. These all result in a reduction in the accuracy of machine scoring of pictorial topics.

Disclosure of Invention

The embodiment of the application provides an answer scoring method, a model training method, a device, a computer readable storage medium and computer equipment, which can improve the stability and accuracy of machine scoring of a question-looking and talking question.

The embodiment of the application provides an answer scoring method, which comprises the following steps:

acquiring voice of the answer of an examinee in a spoken language test, and determining a text sequence of a complete answer text of the examinee and a phoneme sequence of the complete answer text of the examinee according to the voice;

determining full-text speech level representation of the answer of the examinee according to the speech and the phoneme sequence, and determining full-text language level representation of the answer of the examinee according to the text sequence;

acquiring a plurality of model essay key points of the spoken language examination, and determining global key point representation of the answer of the examinee according to the plurality of model essay key points and the text sequence;

and according to the full-text speech level representation, the full-text language level representation and the global main point representation, scoring the answer of the examinee to obtain a score of the answer of the examinee.

The embodiment of the application also provides a spoken language scoring model training method, which comprises the following steps:

acquiring a training data set and an initial spoken language scoring model, wherein the training data set comprises a plurality of training samples in a spoken language test, and each training sample comprises training voice answered by each examinee, label scoring answered by the examinee, a training text sequence of a full text answered by the examinee and a training phoneme sequence of the full text answered by the examinee, which are determined according to the training voice;

inputting the training speech and the training phoneme sequence into a speech level modeling module of the initial spoken language scoring model for coding and decoding so as to determine a training full-text speech level representation answered by the examinee, and inputting the training text sequence into a language level modeling module of the initial spoken language scoring model for language extraction so as to determine a training full-text language level representation answered by the examinee;

acquiring a plurality of training model essay key points of the spoken language test, and inputting the training model essay key points and the training text sequence into a key point processing module of the initial spoken language scoring module to perform key point semantic processing so as to determine the training global key point expression of the answer of the examinee;

inputting the full-text training speech level representation, the full-text training language level representation and the global training point representation into a fusion module of the initial spoken language scoring module for scoring to obtain a training score answered by the examinee;

and updating the initial spoken language scoring model according to the training scores and the label scores to obtain a spoken language scoring model.

The embodiment of the present application further provides an answer scoring device, including:

the voice recognition method comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring voice of an examinee answering in a spoken language test, and a text sequence of a full text of the examinee answering and a phoneme sequence of the full text of the examinee answering which are determined according to the voice;

the voice representing unit is used for determining full-text voice level representation of the answer of the examinee according to the voice and the phoneme sequence;

the language representation unit is used for determining full-text language level representation of the answer of the examinee according to the text sequence;

the second acquisition unit is used for acquiring a plurality of model essay key points of the spoken language test;

the gist representation unit is used for determining the global gist representation of the answer of the examinee according to a plurality of paradigm gist and the text sequence;

and the scoring unit is used for scoring the answer of the examinee according to the full-text speech level representation, the full-text language level representation and the global main point representation so as to obtain the score of the answer of the examinee.

The embodiment of the present application further provides a spoken language scoring model training device, including:

the system comprises a first training acquisition unit, a second training acquisition unit and a third training acquisition unit, wherein the first training acquisition unit is used for acquiring a training data set and an initial spoken language scoring model, the training data set comprises a plurality of training samples in a spoken language test, and each training sample comprises training voice answered by each examinee, label scoring answered by the examinee, a training text sequence of a full text answered by the examinee determined according to the training voice and a training phoneme sequence of the full text answered by the examinee;

a training speech representation unit, configured to input the speech and the phoneme sequence into a speech level modeling module of the initial spoken language scoring model for encoding and decoding, so as to determine a training full-text speech level representation answered by the examinee;

the training language representation unit is used for inputting the training text sequence into a language level modeling module of the initial spoken language scoring model for language extraction processing so as to determine the full-text training language level representation answered by the examinee;

the second training acquisition unit is used for acquiring a plurality of training paradigm key points of the spoken language test;

the training key point expression unit is used for inputting a plurality of training model essay key points and the training text sequence into a key point processing module of the initial spoken language scoring module to perform key point semantic processing so as to determine the training global key point expression of the answer of the examinee;

the training scoring unit is used for inputting the full-text training speech level representation, the full-text training language level representation and the global training point representation into a fusion module of the initial spoken language scoring module for scoring to obtain a training score of the examinee response;

and the updating unit is used for updating the initial spoken language scoring model according to the training score and the label score so as to obtain the spoken language scoring model.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, where the computer program is suitable for being loaded by a processor to perform the steps in the method according to any of the above embodiments.

An embodiment of the present application further provides a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and the processor executes the steps in the method according to any one of the above embodiments by calling the computer program stored in the memory.

According to the answer scoring method, the model training method, the device, the computer readable storage medium and the computer device, the speech answered by an examinee in a spoken language test, the text sequence of the full text answered by the examinee determined according to the speech and the phoneme sequence of the full text answered by the examinee are obtained, the full-text speech level representation answered by the examinee is determined according to the speech and the phoneme sequence, the full-text language level representation answered by the examinee is determined according to the text sequence, a plurality of model essay key points of the spoken language test are obtained, the global key point representation answered by the examinee is determined according to the plurality of model essay key points and the text sequence, scoring is performed on the answer of the examinee according to the full-text speech level representation, the full-text language level representation and the global key point representation, scoring of the answer of the examinee is obtained, the global key point representation is added, the scoring method is fused with representations on various different visual angles, the accuracy is improved, meanwhile, the wrong scoring level of the examinee can be concerned, the wrong pronunciation level of the semantic content and the wrong pronunciation level of the pronunciation of the examination key point of the question can be avoided, and the wrong scoring of the pronunciation level of the pronunciation of the machine can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart illustrating an answer scoring method according to an embodiment of the present disclosure.

Fig. 2 is a schematic network structure diagram of a spoken language scoring model according to an embodiment of the present disclosure.

Fig. 3 is a schematic network structure diagram of a speech ED model according to an embodiment of the present application.

Fig. 4 is a sub-flow diagram of an answer scoring method according to an embodiment of the present disclosure.

Fig. 5 is a schematic flowchart of training of a gist extraction model provided in the embodiment of the present application.

Fig. 6 is a schematic usage flow diagram of a gist extraction model provided in an embodiment of the present application.

Fig. 7 is a schematic flow chart of a spoken language scoring model training method according to an embodiment of the present application.

Fig. 8 is a flowchart of a spoken language scoring model training method according to an embodiment of the present disclosure.

Fig. 9 is a schematic structural diagram of an answer scoring device according to an embodiment of the present disclosure.

Fig. 10 is a schematic structural diagram of a spoken language scoring model training device according to an embodiment of the present application.

Fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides an answer scoring method, an answer scoring device, a computer readable storage medium and computer equipment. Specifically, the answer scoring method in the embodiments of the present application may be executed by a computer device, and the answer scoring apparatus in the embodiments of the present application is integrated in the computer device, where the answer scoring apparatus may be integrated in one or more computer devices, for example, a process of training a spoken language scoring model in the present application is executed in one computer device, and a process of using the spoken language scoring model is executed in another computer device, and correspondingly, a process of training the spoken language scoring model is integrated in one computer device, and a process of using the spoken language scoring model is integrated in another computer device.

The computer device may be a terminal device or a server or the like. The terminal device can be a smart phone, a tablet Computer, a notebook Computer, a touch screen, a game machine, a Personal Computer (PC), a smart car terminal, a robot, and the like. The server can be an independent physical server, a service node in a blockchain system, a server cluster formed by a plurality of physical servers, and a cloud server providing basic cloud computing services such as cloud services, a cloud database, cloud computing, cloud storage, big data and an artificial intelligence platform.

Further analysis of the current scoring protocol is followed before detailed description of the protocol in the examples of the present application.

In the spoken language evaluation scene of the spoken language test, taking the spoken English reading questions with pictures as an example, the answers of the spoken English reading questions are scored. Some off-spectrum scoring errors may occur with current machine scoring, such as error scenarios including the following.

(1) The examinee answers the departure question or the content only matches the keyword but the key point error/key point similarity is low; the expression of the examinee is similar to the keywords of the content part of the examination picture, but the expression meanings are different, and the phenomenon of over-high machine score occurs. For example, the picture is (in english, for convenience of explanation, expressed in chinese) the contents of two colleagues in a book for reading aloud in a library, and the manager nearby persuade them to say 'quiet', and the wall is hung with an icon of 'quiet' word. The answer of the examinee is 'two classmates singing in the library', or only the keyword 'library' is mentioned, or the answer is the answer with low similarity of 'how inspirational pictures are drawn on the books in the library', so that the question with high score exists. Because the existing scoring scheme models the language level scoring and the content scoring of answering based on the full text level or the keyword level, semantic errors of the point level cannot be concerned, the existing scoring scheme does not consider the expression to be errors, and the scoring is high.

(2) The examinees answer the test questions more concisely, but the key points are completely described, and the phenomenon that the score of the machine is lower appears. For example, the picture of the picture is (English is used for answering here, and for convenience of explanation, the picture is expressed in Chinese)' boys play chess on a desk at the grandpa, milk looks at the two smiles beside the boys, and birds fly over the window and have full-fleshed plants. We want to accompany the aged's house more. Some examinees answer only that the boy accompanies grandparents with milk and play on weekends, although the detailed description is omitted, but the most key points are answered, and more scores are given when the key points are found in the scoring calibration set. The existing scheme is often overfitting to the answering length, whether the key points are matched or not is not concerned, information of importance degrees of different key points is not modeled, and each sentence expression is considered to be equally important, so that the score is too low or 0.

(3) The picture description is divergent, and the content expression of the answer of the examinee is novel; the answer expression of the examinee does not appear in the reference sample of the teacher, but the answer content of the examinee fits the meaning of the examination picture or belongs to the category of high-level expression, thereby causing the phenomenon that the machine score is too low. For example, the picture is (English is used as answer here, and for convenience of explanation, the picture is expressed in Chinese) "a boy feeds a chicken on a farm, a girl feeds a sheep for eating grass, and the story tells us that the animal should be cared for". Some examinees finally feel that the story tells that the students should take care of nature or the story tells that the students should develop more interests and hobbies on weekends, the teacher cannot deduct points from the picture meaning, the existing scoring scheme deducts points because only limited examinee point expressions can be learned from the training set, and no method is used for correctly scoring points which do not appear in the training set, so that the divergence expression cannot be covered.

Therefore, the present disclosure provides an answer scoring method for solving at least one of the above problems. The answer scoring method, apparatus, computer-readable storage medium, and computer device in the embodiments of the present application will be described in detail below.

Fig. 1 is a schematic flowchart of an answer scoring method applied to a computer device in an embodiment of the present application, and the method includes the following steps.

101, obtaining the voice of the answer of the examinee in the spoken language test, and determining the text sequence of the full answer text of the examinee and the phoneme sequence of the full answer text of the examinee according to the voice.

The oral test can be oral tests under various different scenes, such as oral english test, oral mandarin test and the like. For example, in an english oral test, a user views a graphic topic, and needs to take the oral test by viewing a picture of the graphic topic. The following description will be made taking an english spoken language examination as an example.

In a spoken language examination, the voice of the examinee is obtained. The method comprises the steps of obtaining voice of an examinee answering in a spoken language test, and carrying out voice recognition processing on the voice to obtain all texts corresponding to the voice, or text sequences of the examinee answering full texts, such as word sequences of the examinee answering full texts in an English spoken language test. And converting the text sequence of the full text answered by the examinee to obtain a phoneme sequence of the full text answered by the examinee. And acquiring a text sequence of the examinee answering full text and a factor sequence of the examinee answering full text.

Since the following will refer to spoken language scoring models, a brief introduction will be made herein. The spoken language scoring model includes six modules: the system comprises a voice level modeling module, a language level modeling module, a general semantic image-text generating module (the module is not needed in some cases), a main point level semantic mistake alignment module, a main point importance degree module and a fusion module. As shown in particular in fig. 2.

The system comprises a speech level modeling module, a language level modeling module and a general semantic graph-text generating module, wherein the speech level modeling module is used for determining full-text speech level representation comprising pronunciation accuracy and fluency, the language level modeling module is used for determining full-text language level representation comprising language level and content, and the general semantic graph-text generating module is used for generating a large number of general model essay key points according to pictures of looking at graph-spoken topics in a spoken language test so as to expand a model essay key point library. The key point level semantic error matching module comprises a key point extraction model and a key point matching pre-training model, wherein the key point extraction model is used for extracting corresponding key points of the examinees in answer according to a plurality of model essay key points. And the point importance degree module is used for determining the global point representation of the answer of the examinee. And the fusion module is used for fusing the full-text speech level representation output by the speech level modeling module, the full-text language level representation output by the language level modeling module and the global key point representation output by the key point importance degree module and realizing final grading.

In some embodiments, the image-text generation module of the general semantics (which may not be needed in some cases), the key point level semantic mistake module, and the key point importance module may also be referred to as a key point processing module.

And 102, determining a full-text speech level representation of the answer of the examinee according to the speech and phoneme sequences.

The method comprises the steps of dividing a voice into a plurality of voice segments, determining a plurality of corresponding phoneme segments according to the plurality of voice segments, and carrying out coding and decoding processing on each voice segment in the plurality of voice segments and each phoneme segment corresponding to each voice segment in the plurality of phoneme segments so as to obtain hidden layer representation of each voice segment; the obtained hidden layer representations of the plurality of speech segments are processed according to phoneme level to obtain full-text speech level representation answered by the examinee.

The part relates to a speech level modeling module of a spoken language scoring model, and the part models the speech level answered by a test taker and mainly comprises two aspects of pronunciation accuracy of each phoneme and fluency of the whole speech.

Firstly, dividing the Voice answered by the examinee into a plurality of Voice segments, such as v Voice segments, by a Voice detection algorithm, such as VAD (Voice activity detection) algorithm, and then obtaining a plurality of ASR transcription text segments corresponding to the Voice segments by using a trained Automatic Speech Recognition (ASR) model, wherein v Voice segments answered by the examinee and corresponding transcription text segment pairs thereof can be used as { (S) ₁ ,A ₁ ),(S ₂ ,A ₂ ),…,(S _v ,A _v ) Denotes, S _i And A _i Respectively representing the ith speech segment and the corresponding transcribed text segment. Converting the transcribed text segment into a corresponding Phoneme segment, e.g. using ASR transcribed text segment A using G2P (Grapheme-to-Phoneme) tool _i Into corresponding phoneme fragments P _i Thus, a plurality of voice segments and a corresponding plurality of factor segments are obtained.

And coding and decoding each voice segment in the plurality of voice segments and each phoneme segment corresponding to each voice segment in the plurality of phoneme segments to obtain hidden layer representation of each voice segment. For example, each speech segment answered by the examinee and each corresponding phoneme segment are input into a speech codec (Encoder-Decoder) model in the speech level modeling module, and are subjected to codec processing to obtain a hidden layer representation of each speech segment, wherein the speech codec model may be referred to as a speech ED model for short, as shown in fig. 2 and fig. 3.

Fig. 3 is a simple schematic diagram of a network structure of the speech ED model. Wherein, the speech coder (encoder layer) in the speech ED model can be a 6-layer convolutional Transformer structure, such as VGG-Transformer or former with 6 layers, and the input is speech S _i And the decoder (decoder layer) may be constructed of a 3-layer transform structure, the input being a phoneme fragment P _i P in FIG. 3 _i = { w, ah, n, d, ey, ih }, which represents a phoneme fragment of the transcribed text fragment "one day". It should be noted that the speech ED model can also be other network structures that can achieve the same function.

The method comprises the steps of inputting each voice fragment (voice X _ wav) into a voice ED model, performing coding processing by using a voice coder (voice-encoder layer) of the voice ED model to obtain a voice coding result, inputting the voice coding result into a phoneme attention layer (Phone attention layer) of the voice ED model to obtain an attention weight of each phoneme in the phoneme fragment corresponding to each voice fragment, inputting the attention weight and each phoneme in the corresponding phoneme fragment into a decoder (decoding-decoder layer) of the voice ED model to perform decoding processing to obtain a hidden layer output (phoneme hidden) of each phoneme in the corresponding phoneme fragment, and outputting the hidden layers of a plurality of phonemes in the corresponding phoneme fragment as hidden layer representation of the voice fragment.

Hidden layer representations of multiple speech segments are obtained in the same manner. Averaging the hidden layer representations of multiple speech segments according to phoneme level to obtain full-text speech level representation h including pronunciation accuracy and fluency _speech 。

For example ASR transcribes text A ₁ To "one day", then the corresponding phoneme fragment P ₁ = w, ah, n, d, ey, ih, the input speech ED model will yield a hidden layer representation h of 6 phonemes _speech1 Dimension of [6,512]Where 512 is the dimension of the hidden layer representation for each phoneme; ASR transcribed text A ₂ Is "human buy sounding", then the corresponding phoneme fragment P ₂ ＝{w,uh,m,ax,n,bAy, s, ah, m, th, ih, ng }, the input speech ED model will obtain a hidden layer representation h of 13 phonemes _speech2 Dimension of [13,512]. The "averaging hidden layer representations of a plurality of speech segments at a phoneme level" means averaging the hidden layer representations of the same phoneme. Assuming that in one case there are 48 total phonemes in English, to obtain a hidden layer representation of phoneme "w", we start with h _speech1 Taking a hidden layer representation of "w" at position 1, from h _speech2 The hidden layer representation of the phoneme 'w' at the position 1 is taken, and the two hidden layer representations are averaged to obtain the final hidden layer representation of the phoneme 'w'. By analogy, a hidden layer representation of a phoneme level is obtained, namely 'phone hidden', and the dimension is [48,512 ]]And 48 is the size of the english phoneme list, i.e. the full text phonetic level representation including pronunciation accuracy and fluency.

It should be noted that, in the present application, the speech ED model is not used to directly obtain the corresponding score, but the hidden layer feature output is obtained as the full-text speech level representation of the answer of the examinee. Alternatively, it can be understood that the speech ED model in the present application does not have a last linear connection layer/linear mapping layer, such as a full connection layer and/or softmax layer, but directly obtains the feature output of the hidden layer.

And 103, determining full-text language level representation of the answer of the examinee according to the text sequence.

The part relates to a language level modeling module of a spoken language scoring model, and the part models the language level answered by the examinee and mainly comprises two aspects of the language level and the content based on the full text.

The speech level modeling module uses a language pre-training model, such as a language BERT model. The language BERT model can be a stack of multiple layers, such as 12 layers of transform structures, and uses a language representation model for performing shape filling and blank pre-training tasks on public text data such as Wikipedia and the like. The language pre-training model may also be other models that perform the same function.

Wherein, the text sequence of the examinee's answering full text, such as word sequence, and the initial identifier of the text sequence are input into the language BERT model for full text semantic extraction processing to obtainFull-text hidden layer output, and finally, taking hidden layer output at the position corresponding to the initial identifier of the text sequence in the full-text hidden layer output as full-text language level representation h of examinee answers _lm 。

For example, if the text sequence of the test taker answering full text is "human buy sounding", then the content in the input language BERT model is tagged with the start identifier of the text sequence, e.g.) "<cls>Man buy soving, wherein,<cls>i.e. the start identifier of the text sequence. Finally, after being processed by the language BERT model, the language BERT model is used for processing<cls>The hidden layer output corresponding to the position is taken as h _lm I.e. the "language" in fig. 2.

It should also be noted that the module also does not receive the final corresponding score, but directly receives the hidden layer output. Or it can also be understood that the language pre-training model in the module does not have the last linear connection layer/linear mapping layer, such as the full connection layer and/or the softmax layer, and only needs to obtain the corresponding hidden layer output.

And 104, acquiring a plurality of model essay key points of the spoken language test, and determining global key point representation of the answer of the examinee according to the plurality of model essay key points and the text sequence.

A plurality of model essay key points of the oral test can be obtained in advance and stored in a model essay key point library. The multiple model essay points in the reference answers given by the teacher are the most basic model essay points, and the model essay points cannot cover some novel expressions, so that the multiple model essay points can be generated in an intelligent generation mode to expand possible point expressions in pictures of the picture saying the topics, so that a model essay point library is expanded.

In one aspect, the step of obtaining a plurality of model essay key points of a spoken language examination includes: acquiring an examination picture of a picture-viewing and speaking question corresponding to a spoken language examination; carrying out picture coding processing on the examination picture to obtain a picture coding result; and generating a plurality of model essay key points of the test picture according to the picture coding result. The model essay main points are obtained by automatically generating the model essay main points, and the model essay main point library is expanded to enable the model essay main points to comprise more model essay main points, so that more divergent expressions can be covered, the situation that the machine score is low under the scene that the answer content of the examinees is novel due to the divergence of picture description is avoided, or the answer expression of the examinees does not appear in a reference sample of a teacher, but the answer content of the examinees accords with the meaning of the examination pictures or belongs to the category of high-level expression, so that the situation that the machine score is too low is caused.

The part for generating a plurality of model essay main points relates to a graph-text generation module of general semantics. The general semantic graph-text generation module includes a graph-text generation model, such as a graph-text generation CLIPCap model, as shown in fig. 2. The image-text generation model inputs pictures of the questions of looking at pictures of the spoken language examination and outputs a large number of model essay key points related to the picture contents, namely, a model essay key point expression library is generated.

The method comprises the steps of utilizing a picture coding module in a picture-text generation model, such as a CLIP (continuous Language-Image Pre-Training) model of the picture-text generation CLIPCap model, to code an examination picture to obtain a picture coding result, utilizing a picture decoding module in the picture-text generation model, such as a GPT2 model (genetic Pre-transmitted transform 2) in the picture-text generation CLIPCap model, to decode the picture coding result to generate a plurality of model points of the examination picture, namely utilizing a decoding module, such as the GPT2 model, to generate a generating task of text expression corresponding to the picture.

The CLIP model is a pre-trained general model for performing comparative learning on training by using massive graphics and texts, and can achieve fine-grained coding of pictures and deep representation including semantics. After picture information of topics is viewed and inputted into a picture and text generation CLIPCap model, a model text point expression library containing k results and containing k model text points { F) can be obtained through multiple decoding ₁ ,F ₂ ,…,F _k }, e.g. F ₁ ＝The boy is playing chess with the olders，F ₂ ＝The oldman is sitting on a chair，F ₃ = oldwman is standing by the m. It should be noted that the graphics-text generation model in the graphics-text generation module with general semantics, such as the graphics-text generation CLIPCap model, is also a model obtained by pre-training.

And after a plurality of model essay key points are obtained, determining the global key point expression of the answer of the examinee according to the plurality of model essay key points and the text sequence. Determining a global gist representation to be answered by the examinee relates to a gist level semantic error module and a gist importance module.

The method comprises the following steps of performing gist extraction processing on each of a plurality of paradigm gist and text sequences to determine examinee answering gist corresponding to each paradigm gist; and (4) performing attention semantic extraction processing at the key point level on each model essay key point and the corresponding examinee answering key point to obtain global key point representation of the examinee answering. In the embodiment, after the examinee answering points corresponding to each model essay key point are extracted, attention semantics extraction processing at the key point level is performed to pay attention to semantic information at the key point level, and because the semantic information at the key point level can be paid attention to, the problem that the machine score is higher due to the fact that the examinee answering problems or key words of contents are matched but the key points are wrong/the key point similarity is low can be solved.

As shown in fig. 4, the step of determining a global gist representation of the answer of the test taker according to the plurality of normative gist and the text sequence includes the following steps 201 to 203.

And 201, performing gist extraction processing on each of the model essay gist and the text sequence in the plurality of model essay gist to determine an examinee answering gist corresponding to each model essay gist.

The method comprises the steps of obtaining a plurality of spliced text sequences, inputting each spliced text sequence into a main point extraction model in a main point level semantic error module, and obtaining a test taker answering main point corresponding to each model essay main point and a first similarity (also called as main point similarity) between each model essay main point and the test taker answering main point. Wherein the first similarity plays a role in training the spoken language scoring model, as will be referred to hereinafter.

FIG. 5 is a schematic diagram of a training process of the gist extraction model. The key point extraction model may include a multi-layer transform structure, such as a 6-layer transform structure, or the key point extraction model may be another structure that achieves the same function.

Each spliced text sequence comprises a model essay main point and a text sequence of a test taker answering full text. The template text key points corresponding to the talking questions of the spoken language examination comprise 3 and also comprise a text sequence of the whole answer text of the examinee. One of the stitched text sequences may be < cls > oldwman is standing by the < sep > Tom is playing chemistry with a height mapping a. The mapping is sitting with an aft. Wherein < cls > is the beginning identifier of the key point of the model sentence, and < sep > is the beginning identifier of the text sequence of the test taker answering the whole text.

Inputting each spliced text sequence into a key point extraction model, and performing key point extraction processing on the spliced text sequence by using the key point extraction model to mark a starting position Pos _ start, such as a starting word position, and an ending position Pos _ end, such as an ending word position, of each model essay key point in the text sequence answered by the students, and to mark key point similarity keypoint of the model essay key point and the extracted student answering key point _simi ，keypoint _simi ∈[0,1]. The key point extraction model is input to a text sequence of each model essay key point and an examinee answering full text, and the number of the key point extraction model is 3, namely, the key point similarity Keypoint _ Sim and the starting position and the ending position of the extracted examinee answering key point are respectively.

As shown in fig. 5, since there are three stitched text sequences, the output of the gist extraction model is the similarity of the gist corresponding to the three stitched text sequences, and the start position and the end position of the extracted examinee answering gist, for example, keypoint _ Sim =1, pos_start =0, pos_end =6; keypoint _ Sim =1, pos _start =8, pos _end =14; keypoint _ Sim =0.5, pos _start =16, pos _end =20.

The text corresponding to the starting position and the ending position of each model essay gist in the text sequence of the answer of the examinees is taken as the answer point of the examinees corresponding to each model essay gist, for example, for the model essay 1, pos _ start =0, pos_end =6, the text corresponding to the answer point of the examinees is "Tom is playing chess with his gradpa.

Expressing the answer points of the examinees corresponding to the key points of each model essay as { P ₁ ,P ₂ ,…,P _k }. It should be noted that some of the points of the model essay do not have corresponding points of student response.

Fig. 6 is a schematic diagram illustrating a usage flow of the gist extraction model provided in the embodiment of the present application. Inputting the model essay key points and the text sequence answered by the examinees into the key point extraction model, and outputting model essay key points F _i Similarity degree of key points to answer key points of examinees

Hendangwen Key point F _i Matched examinee answering main points P _i ，i∈[1,k]. The text sequence of the examinee answering whole text is 'He is playing chess.His grandis is happy.' and corresponds to 3 model essay key points, k =3, and 3 examinee answering key points and key point similarity between each model essay key point and the examinee answering key points can be obtained. For example, P ₁ = He is playing chess, the point similarity is 0.8; p is ₂ = His grandis happy, the key point similarity is 0.4; p ₃ = "", i.e. P ₃ The similarity of the corresponding points is 0.1.

The key point extraction model is obtained through pre-training in advance, and can be specifically shown in fig. 5. For each question/question of seeing a topic from a picture, preparing template feature point data corresponding to the question and a text sequence of answer full texts of examinees of a plurality of different examinees in advance, and then training a key point extraction model. According to a large amount of historical data of topics to be watched, a universal key point extraction model can be obtained through pre-training.

Therefore, although the spoken language scoring model comprises the image-text generation model in the image-text generation module with the general semantics and the key point extraction model, the image-text generation model and the key point extraction model are obtained through pre-training, the two models do not participate in the modification of network parameters in the training process of the spoken language scoring model, and the two models are directly used in the training process.

And 202, performing semantic extraction processing at the key point level on each model essay key point and the corresponding test taker response key point to obtain full-text key point semantic representation of the test taker response and hidden-layer semantic representation of the context similarity between each model essay key point and the corresponding test taker response key point.

This step involves the gist matching portion of the gist-level semantic pair-wise module. The gist matching section involves the gist matching a pre-trained model, such as the gist matching a BERT model, as shown in FIG. 2.

The method comprises the following steps that text splicing processing can be carried out on each model essay key point and the corresponding examinee answering key point to obtain a text of each key point pair, wherein each key point pair text comprises a starting identifier; inputting the global key point representation identifier and a plurality of key point pair texts into a key point matching pre-training model, and performing key point level semantic extraction processing to obtain full text key point semantic representation of the answer of the examinee and hidden layer semantic representation corresponding to the texts by a plurality of key points; and taking the hidden-layer semantic representation at the position corresponding to the model starting identifier of the text of each main point pair in the text of the plurality of main point pairs as the hidden-layer semantic representation of the context similarity between the texts of each main point pair.

Performing text splicing on each model essay key point and the corresponding examinee answering key point, and if 3 model essay key points exist, obtaining 3 key point pairs of texts { (F) ₁ ,P ₁ ),(F ₂ ,P ₂ ),(F ₃ ,P ₃ ) And each point makes a response point to the examinee, wherein the text comprises a model essay point and the answer point corresponding to the model essay point. Wherein at least a start identifier is added to the text at each gist point. For example, a model starting identifier is added before each point pair of each model essay key point in the text, and a test taker answering starting identifier is added before each point pair of each test taker answering point in the text, so as to distinguish the model essay key point from the test taker answering point.

Wherein a global gist representation identifier may be added before all gist pairs of text. For example, the global gist representation identifier and 3 gist pairs may be represented as "[ MASK ] < cls > The boy is playing chemistry with The unders < section > He is playing chemistry < section > The Old man is positioning on a chair < section > His mapping is happy < section > Old world is standing by The text < section >". Wherein, [ MASK ] is a global main point representation identifier, < cls > represents a starting identifier of each main point to a model main point in the text, and < sep > represents a starting identifier of each main point to a test taker answering the main point in the text.

Inputting a global main point representation identifier and a plurality of main point pair texts (added with a model beginning identifier and an examinee response beginning identifier) into a main point matching BERT model, and performing main point level semantic extraction processing by using the main point matching BERT model to obtain hidden layer representation of each input text, wherein the hidden layer representation comprises the full text main point semantic representation of the examinee response and the hidden layer semantic representation corresponding to the texts by a plurality of main points.

For example, the input "[ MASK ] < cis > The boy is playing chemistry with The layers < seq > The Old man is matching on a chain < sec > His grandidpa is happy < cl > Old world is standing by The m < sec >" of The gist matching BERT model includes 37 input texts, including words and symbols, etc. After the key point matching BERT model is used for extracting key point level semantics, hidden layer semantic representations of 37 input texts are obtained, namely each input text corresponds to one hidden layer semantic representation, and if the hidden layer semantic representations are 512 dimensions, the key point matching BERT model outputs 37-512 dimension hidden layer semantic representations.

Extracting the starting identifier of each key point in the plurality of key point pairs to the starting identifier in the text, such as the model text key point<cls>The hidden layer semantic representation at the corresponding position is used as the hidden layer semantic representation of the context similarity between each main point pair text. Meaning that only the hidden semantic representation h at the position corresponding to the global gist representation identifier needs to be extracted at the end of the gist matching BERT model _mask And each point pair in the textStarting identifier of the starting point of the model sentence<cls>Hidden layer semantic representation h at corresponding position _{cls_1} ,h _{cls_2} ,…,h _{cls_k} 。

Hidden layer semantic representations at 4 positions such as [ MASK ] < cls > < cls > < cls >, correspond to 4 × 512 dimensions. Wherein [ MASK ] < cls > < cls > < cls > respectively represents the global gist representation identifier, the first model essay gist start identifier, the second model essay gist start identifier and the third model essay start identifier from front to back, and corresponding numbers are added behind cls to represent the second model essay start identifier in fig. 2 for distinguishing purposes, so as to facilitate understanding.

Hidden layer semantic representations at 4 positions [ MASK ] < cls > < cls > < cls >, which are respectively expressed as full-text main point semantic representation of answer of an examinee, hidden layer semantic representation of context similarity between a first main point and a text, hidden layer semantic representation of context similarity between a second main point and the text and hidden layer semantic representation of context similarity between a third main point and the text. The extracted hidden layer semantic representation includes context information such as similarity information/matching information of context main point information.

This point matches the BERT model from the input to the output we last extracted, and the following can be seen:

a. the gist-matching BERT model input includes a global gist identifier and a plurality of gist texts, such as word sequences and the like, and it is noted that the input includes a plurality of gist-to-text, rather than a single gist-to-text, in preparation for extracting context information, such as contextual gist information, and if a single gist-to-text, there is no context information.

b. The limiting model is a point matching pre-training model, such as a point matching BERT model, so that the input context information can be extracted, and the context similarity not only comprises the similarity/matching degree information between the model essay text and the answer text of the examinee in the text by each point, but also comprises the context point information so as to assist in determining the similarity/matching degree information by using the context point information. Because the context texts have a correlation relationship, the accuracy can be improved by using the context main point information to assist in determining the similarity/matching degree information.

c. The key point matching BERT model input comprises a lot of information, such as 37 pieces of information of key points, such as word sequences in texts, and after extraction, only 4 corresponding hidden-layer semantic representations are correspondingly obtained, so that other information is merged into the 4 corresponding hidden-layer semantic representations, dimension reduction is realized to a certain extent, a basis is provided for subsequent scoring, and if the key point matching BERT model input comprises a lot of information, final scoring is not convenient.

d. The hidden-layer semantic representation at the position corresponding to the starting identifier of each point pair text, such as the model main point starting identifier, is used as the final hidden-layer semantic representation of the context similarity between each point pair text, and this can be achieved because the hidden-layer semantic representation at the position corresponding to the starting identifier of each point pair text is processed by the linear connection layer/linear mapping layer during training, and the loss value is calculated by using the processing result and the first similarity in the main point extraction model, which will be described in detail in the following training section.

And 203, performing attention mechanism processing on the full-text main point semantic representation and the hidden layer semantic representation to obtain a global main point representation after the attention weight weighting of the hidden layer semantic representation.

Matching extracted outputs in the BERT model to key points, i.e. full text principal point semantic representation h _mask And hidden semantic representation h of the context similarity between each point pair text _{cls_1} ,h _{cls_2} ,…,h _{cls_k} The attention mechanism is processed to model importance information of different key points, namely the importance of the different key points, so that the problem that the answer of an examinee is simple, but the 'key point' describes the condition that the score of a machine is low in a complete scene is solved.

The part relates to a main point importance module of the spoken language scoring model, and the main point importance module comprises a main point importance model. Splicing the full-text main point semantic representation and the hidden layer semantic representation to obtain spliced semantic representation; inputting the spliced semantic representation into a main point importance degree model for attention weight processing to obtain attention weights corresponding to full-text key semantic representation and a plurality of hidden layer semantic representations; and performing weighting processing according to the attention weights corresponding to the hidden layer semantic representations and the corresponding hidden layer semantic representations, and determining the global main point representation after the attention weights of the hidden layer semantic representations are weighted.

As in FIG. 2, matching the extracted gist to the output h of the BERT model _mask 、h _{cls_1} ,h _{cls_2} ,…,h _{cls_k} Together as input to the point importance model. As the point importance model currently includes 4 x 512 dimensions of inputs.

The main point importance model is a Self-Attention mechanism (Self-Attention) module, and the formula of the Attention weight of the Self-Attention mechanism module is shown as formula (1).

Wherein Q = FC ₁ (H),K＝FC ₂ (H),V＝FC ₃ (H),d _k Is the dimension of the gist representation, i.e. the dimension of the latent semantic representation, e.g. d _k ＝512，FC _i Are all linear mapping layers/linear connection layers, H = { H = { (H) _mask ,h _cls1 ,h _cls2 ,…,h _clsk Therein is here

The method prevents the inner product from being calculated too much, avoids the phenomenon of peak appearing in the softmax result, and causes that certain or a plurality of data values are far larger than other data.

The importance of different points is represented by the attention weight calculated by the softmax () function in formula (1). The output of the softmax () function is k +1 scores, the 1 st score represents h _mask The importance of the self, the following k scores correspond to the hidden layer semantic representation h of the k key points _{cls_i} Corresponding to a full-text main point semantic representation and a plurality of hidden layer semantic representations respectivelyAttention weight. The importance of the specific points is not related to the similarity of the points or the sequence of the points, but is learned through the current training data set.

After attention weights corresponding to the full-text key semantic representation and the hidden-layer semantic representations are obtained, weighting processing is carried out according to the attention weights corresponding to the hidden-layer semantic representations and the hidden-layer semantic representations, and the global key representation after the attention weights of the hidden-layer semantic representations are weighted and the multiple hidden-layer semantic representations after the global key representation and the hidden-layer semantic representations are weighted are determined.

Wherein, the input of the key point importance degree model is { h _mask ,h _{cls_1} ,h _{cls_2} ,…,h _{cls_k} H 'as output' _mask ,h′ _{cls_1} ,…,h′ _{cls_k} Output dimensions and sizes coincide with the input, e.g. both 4 x 512. Wherein the hidden layer semantics of the k key point pairs represent h _{cls_i} Weighting the hidden layer semantic representation h 'by the corresponding attention weight to obtain weighted hidden layer semantic representation h' _{cls_i} And determining a global gist representation according to the k weighted hidden layer semantic representations, for example, adding the k weighted hidden layer semantic representations to obtain a global gist representation, and the like, wherein the global gist representation h _KPcls ＝h′ _mask 。

The output of the key point importance module is the global key point representation h' _mask As the final output, namely 1 x 512, k weighted hidden layer semantic representations h 'are discarded' _{cls_i} 。

From the input to the final output in the point importance module, the following two points can be seen:

a. the input in the key point importance module comprises full-text key point semantic representation and a plurality of hidden-layer semantic representations, namely the input comprises multidimensional information, such as 4 × 512 dimensionality, and the output only takes one output, namely the global key point representation, which is 1 × 512 dimensionality, which is equivalent to the realization of dimension reduction by using the key point importance module.

b. The final output of the key point importance module is global key point expression which is obtained after the hidden layer semantic expression of the context similarity between each model and each test taker answering point and the corresponding attention weight weighting processing, the global key point expression fuses importance information of different key points in the test taker answering text, and the problem that the answer of the test taker is simple, but the key point describes the condition that the machine score is low in a complete scene is solved.

To this end, a full-text speech level representation representing pronunciation accuracy and fluency, a full-text language level representation representing language level and content, and a global gist representation representing importance of the gist have been obtained, i.e., representations at a plurality of different perspectives of a full-text answer by test takers have been obtained.

And 105, scoring the answer of the examinee according to the full-text speech level representation, the full-text language level representation and the global main point representation to obtain a score of the answer of the examinee.

The step involves a fusion module in the spoken language scoring model, wherein the fusion module comprises a fusion model.

The full-text speech horizontal representation, the full-text language horizontal representation and the global key point representation can be fused to obtain a fused hidden layer representation fused with the key point representation; and inputting the fusion hidden layer representation into a fusion model in a fusion module for nonlinear scoring processing to obtain the score of the answer of the examinee.

Wherein, the full text speech level is represented as h _speech Full text language level representation h _lm And a global gist representation h _KPcls Spliced into a fused hidden layer representation h _concat ＝(h _speech ,h _lm ,h _KPcls ) And inputting the fusion hidden layer representation into a fusion model for nonlinear scoring processing so as to synthesize the scoring scales of the examinee in the current examination at various different viewing angles to obtain the score answered by the examinee.

The formula of the fusion model is shown in the following formula (2).

Pred _fuse ＝sigmoid(W ₁ h _concat +b ₁ ) (2)

Wherein Pred _fuse The score for the final test taker response was [0,1]The numerical value of the interval, sigmoid, represents a nonlinear function, and nonlinear scoring is performed using the nonlinear function.

According to the embodiment, full-text voice level representation and full-text language level representation are obtained, global key point representation is added, the scoring method is integrated with representation in various different visual angles, accuracy is improved, semantic errors of key point levels can be concerned in the scoring method, importance of different key points can be concerned, and stability and accuracy of machine scoring in spoken language examinations are improved.

Fig. 7 is a schematic flowchart of a spoken language scoring model training method according to an embodiment of the present application, the method is mainly used for training a spoken language scoring model, and the spoken language scoring model can be applied to an answer scoring method according to any of the embodiments described above, and the method includes the following steps.

301, a training data set and an initial spoken language scoring model are obtained, wherein the training data set comprises a plurality of training samples in a spoken language examination, and each training sample comprises training speech answered by each examinee, label scoring answered by the examinee, a training text sequence of a full text answered by the examinee and a training phoneme sequence of the full text answered by the examinee, which are determined according to the training speech.

The label score may be a relatively accurate score of the predetermined answer of the examinee, for example, the answer of the examinee may be manually scored to obtain the label score.

The training speech, the training text sequence and the training phoneme sequence are different from the speech, the text sequence and the phoneme sequence in the above text only in that training two characters are added in the front, but the essential meanings are consistent, so that a plurality of nouns in the following text are also the same, and the description will not be repeated.

The initial spoken language scoring model is a model needing network parameter updating, and the modules included in the initial spoken language scoring model are consistent with the modules included in the spoken language scoring model.

302, inputting the training speech and the training phoneme sequence into the speech level modeling module of the initial spoken language scoring model for encoding and decoding to determine the full-text speech level representation of the test taker answering, and inputting the training text sequence into the language level modeling module of the initial spoken language scoring model for language extraction to determine the full-text language level representation of the test taker answering.

The method comprises the steps of dividing training voice into a plurality of training voice segments, and determining a plurality of training phoneme segments corresponding to the training voice segments; inputting each training speech segment in the plurality of training speech segments and each training phoneme segment corresponding to each training speech segment in the plurality of training phoneme segments into a speech ED model in a speech level modeling module for coding and decoding so as to obtain a training hidden layer representation of each training speech segment; and processing the obtained training hidden layer representations of the training speech segments according to the phoneme level to obtain a training full-text speech level representation answered by the examinee. The full-text speech level representation of the training includes pronunciation accuracy and fluency information to be answered by the test taker.

The training text sequence can be input into a language BERT model in a language level modeling module for language extraction processing so as to obtain the full-text training language level representation of the answer of the examinee. The training full-text language level representation includes the language level and content information answered by the examinee.

303, acquiring a plurality of training model essay key points of the spoken language test, and inputting the plurality of training model essay key points and training text sequences into a key point processing module of the initial spoken language scoring model to perform key point semantic processing so as to determine the training global key point expression of the answer of the examinee.

Wherein, the step of obtaining a plurality of training model essay main points of spoken language examination includes: acquiring a training examination picture of a picture-looking and speaking question corresponding to a spoken language examination; performing picture coding processing on the training test picture to obtain a training picture coding result, for example, inputting the training test picture into a picture and text generation model in a general semantic picture and text generation module, and performing picture coding processing on the training test picture by using a picture coding module of the picture and text generation model; and generating a plurality of training pattern points of the training test picture according to the training picture coding result, for example, decoding the training picture coding result by using a picture decoding module of the image-text generation model to generate a plurality of training pattern points of the training test picture.

Wherein, carry out main points semantic processing with a plurality of training model essay main points and training text sequence input to the main points processing module of initial spoken language score model in to confirm the step that the training global main points that the examinee answered expressed, include: performing key point extraction processing on each training model essay key point and each training text sequence in the plurality of training model essay key points by using a key point extraction model in a key point processing module to determine an examinee answering training key point corresponding to each training model essay key point and a first similarity between the examinee answering training key point and the training model essay key point; and (3) matching a pre-training model and a key importance degree model by using the key points in the key point processing module, and performing key point level attention semantic extraction processing on each training model essay key point and the corresponding test taker answering training key point to obtain the training global key point expression of the test taker answering.

The method comprises the following steps that a plurality of training paradigm key points and corresponding examinee answering training key points can be input into a key point matching pre-training model of a key point level semantic error module to carry out key point level semantic extraction processing, so that training whole-text key point semantic representation of examinee answering and training hidden-layer semantic representation of context similarity between each training paradigm key point and the corresponding examinee answering training key point are obtained; and inputting the full-text training point semantic representation and the hidden training layer semantic representation into a point importance model of a point importance module for attention mechanism processing to obtain a training global point representation after the attention weight of the hidden training layer semantic representation is weighted.

Wherein, the aforesaid utilizes the main points extraction model in the main points processing module to carry out the main points extraction processing to every training model essay main points and training text sequence in a plurality of training model essay main points to confirm the examinee's answer training main points that every training model essay main points correspond, and the examinee's answer training main points and the step of the first similarity between the training model essay main points, include: splicing each training model essay key point in the training model essay key points and the training text sequence to obtain a plurality of training spliced text sequences; and inputting each training spliced text sequence in the plurality of training spliced text sequences into a main point extraction model to perform main point extraction processing so as to obtain an examinee answering training main point corresponding to each training model essay main point and a first similarity between the examinee answering training main point and the training model essay main point.

Wherein, the aforesaid will train every model essay key point and the test taker who corresponds to answer training key point, input to the key point matching of key point level semantic right wrong module and carry out key point level's semantic extraction processing in the training model in advance to the training full text key point semantic that obtains the test essay answer shows and every training model essay key point and the training hidden layer semantic of the context similarity between the corresponding test taker answer training key point show the step, include: performing text splicing treatment on each training model essay key point and the corresponding test taker answering training key point to obtain each training key point pair text, wherein each training key point pair text comprises a starting identifier; inputting the global key point representation identifier and a plurality of training key point pair texts into a key point matching pre-training model in a key point level semantic error module, and performing key point level semantic extraction processing to obtain training full text key point semantic representation answered by examinees and training hidden layer semantic representation corresponding to the texts by the training key point pairs; and extracting training hidden layer semantic representations of the training key points in the text at the positions corresponding to the starting identifiers of the training key points to the text as the training hidden layer semantic representations of the context similarity between the training key points to the text.

Wherein, the above-mentioned whole text main point semantic representation of training and training hidden layer semantic representation are input and are carried out attention mechanism processing in the main point importance model of main point importance module to obtain the step that the global main point of training after the attention weight weighting that the hidden layer semantic representation of training represents, include: splicing the main point semantic representation and the training hidden layer semantic representation of the training full text to obtain training spliced semantic representation; inputting the training spliced semantic representation into a main point importance model of a main point importance module for attention weight processing to obtain attention weights corresponding to the main point semantic representation and a plurality of training hidden layer semantic representations in the full-text training; and performing weighting processing according to the attention weights corresponding to the training hidden layer semantic representations and the corresponding training hidden layer semantic representations, and determining the training global main point representation after the attention weights of the training hidden layer semantic representations are weighted.

And 304, inputting the full-text training speech level representation, the full-text training speech level representation and the global main point representation into a fusion module of the initial spoken language scoring model for scoring to obtain a training score answered by the examinee.

The method comprises the steps of performing fusion processing on full-text training speech level representation, full-text training speech level representation and global training key point representation to obtain training fusion hidden layer representation fused with key point representation; and inputting the training fusion hidden layer representation into a fusion model in a fusion module for nonlinear scoring processing to obtain a training score for the examinee to answer.

And 305, updating the initial spoken language scoring model according to the training scores and the label scores to obtain a spoken language scoring model.

And calculating a loss value according to the training score and the label score, updating the network parameters of the initial spoken language score by using the loss value until a training stopping condition is met, and stopping training if the loss value is converged or the training round number reaches a preset round number to obtain a spoken language score model.

The training of the spoken language scoring model in the embodiment of the present application is end-to-end training. In the training process, the network parameters which can be optimized comprise a voice level modeling module, a language level modeling module, a main point matching pre-training model in a main point level semantic error module, a main point importance module and a fusion module, namely a part represented by a solid line in fig. 2, when the network parameters are updated, the network parameters of the part represented by the solid line are updated, the model in the virtual representation part is the model after pre-training, and the network parameters do not need to be updated.

In the training spoken language scoring model, training full-text speech level representation including pronunciation accuracy and fluency and training full-text language level representation including language level and content are considered, and training global main point representation is further included, so that the scoring method of the spoken language scoring model integrates representation on various different visual angles, accuracy is improved, semantic errors at the main point level can be concerned in the scoring method, importance of different main points is concerned, and stability and accuracy of machine scoring in a spoken language test are improved.

In an embodiment, as shown in fig. 8, another flow chart of the training method of the spoken language scoring model is provided, the method is mainly used for training to obtain the spoken language scoring model, and the spoken language scoring model can be applied to the answer scoring method in any of the above embodiments, and the method includes the following steps.

401, a training data set and an initial spoken language scoring model are obtained, where the training data set includes a plurality of training samples in a spoken language examination, and each training sample includes training speech answered by each examinee, label scoring answered by the examinee, a training text sequence of a full text answered by the examinee determined according to the training speech, and a training phoneme sequence of the full text answered by the examinee.

402, inputting the training speech and the training phoneme sequence into a speech level modeling module of the initial spoken language scoring model for encoding and decoding to determine a full-text training speech level representation to be answered by the examinee, and inputting the training text sequence into a language level modeling module of the initial spoken language scoring model for language extraction to determine a full-text training language level representation to be answered by the examinee.

And 403, acquiring a plurality of training model essay key points of the spoken test, and performing key point extraction processing on each training model essay key point and the training text sequence in the plurality of training model essay key points by using a key point extraction model of the initial spoken scoring model to determine an examinee answering training key point corresponding to each training model essay key point and a first similarity between the examinee answering training key point and the training model essay key points.

404, inputting the multiple training model essay key points and corresponding test taker answering training key points into a key point matching pre-training model of the initial spoken language scoring model to perform key point level semantic extraction processing, so as to obtain training full-text key point semantic representations of the test taker answering and training hidden layer semantic representations of context similarity between each training model essay key point and the corresponding test taker answering training key point, and performing linear mapping processing on the training hidden layer semantic representations by using a linear connection layer so as to obtain a second similarity corresponding to each training hidden layer semantic representation.

The following is a description of the above examples. The dimension of the output of the gist matching pre-training model is 4 × 512, and the dimensionality comprises training full-text gist point semantic representation (1 × 512) of the answer of the examinee and training hidden-layer semantic representation (3 1 × 512) of the context similarity between each training paradigm gist point and the corresponding examinee answer training gist.

In the training process, the key point matching pre-training model also comprises a newly added linear connecting layer/linear mapping layer, such as a full connecting layer or a softmax layer, the semantic representation of the training full text key points output in the front and the training hidden layer semantic representation of the context similarity between each training model text key point and the corresponding examinee answering training key point are input into the newly added linear connecting layer/linear mapping layer, linear mapping processing is carried out by utilizing the linear connecting layer/linear mapping layer to obtain the 4*1 dimensional output result, and the output result of each training hidden layer semantic representation is used as the second similarity corresponding to the training hidden layer semantic representation. The second similarity has 3 data, which respectively correspond to the similarity represented by each training hidden layer semantic expression. In the training process, the second similarity is trained by taking the corresponding first similarity as a target.

A second loss value is determined 405 based on the first similarity and the second similarity.

In the training process, an optimization loss function is set for the key point matching pre-training model, and the optimization loss function is shown in formula (3).

Therein, FC _kp Is a newly added linear connection layer/linear mapping layer, k is the number of the training paradigm key points of the question,

represents the first similarity between the ith training paradigm key point and the extracted corresponding test taker response training key point, and/or>

Representing the training hidden layer semantic representation corresponding to the ith training paradigm principal point, L _keypoint Representing the determined second loss value.

It should be noted that this loss function for the point matching pre-trained model is not straightforward to understand. How to ensure that the hidden layer output of the key point matching pre-training model at the corresponding position of < cls > can comprise similarity information (hidden layer representation of similarity) of each training key point pair (a training model key point and a corresponding test taker response training key point), but not other information such as keyword information and the like, is realized by means of the loss function, and the hidden layer output of the key point matching pre-training model at the corresponding position of < cls > can learn the similarity information by the loss function.

In addition, the main point matching pre-training model is used instead of directly calculating the similarity information between each training main point pair, on one hand, the main point matching pre-training model, the main point importance degree module and the network parameters of the fusion module in the speech level modeling module, the language level modeling module and the main point level semantic error module in the spoken language scoring model are optimized end to end, if the similarity between each training main point pair is directly calculated, the similarity is only a specific numerical value, no optimization is available, and no parameter is returned during training; on the other hand, the point matching pre-training model is used, so that the result utilizes context information in addition to the similarity information of each training important point pair, and utilizes the context main point information to assist in evaluating the similarity, so that the similarity also comprises the context information such as the context main point information.

Because the context information is utilized, the input of the key matching pre-training model comprises a plurality of texts with different training key pairs, for example, a training model essay principal point 1 and a test taker answering training principal point 1, a training model essay principal point 2 and a test taker answering training principal point 2, a training model essay principal point 3 and a test taker answering training principal point 3, or else, the training hidden layer semantic representation of the training principal point pair can be obtained directly according to the training model essay principal point 1 and the test taker answering training principal point 1, and the text with a plurality of different training principal points does not need to be input simultaneously.

The content of this portion should be understood in conjunction with the corresponding description of the portion using the spoken language scoring model, which is not repeated herein.

And 406, inputting the training full-text key semantic representation and the training hidden layer semantic representation into a key importance model of the key importance module for attention mechanism processing to obtain a training global key representation after the attention weight of the training hidden layer semantic representation is weighted.

407, inputting the full-text training speech level representation, the full-text training language level representation and the global main point representation into a fusion module of the initial spoken language scoring model for scoring to obtain a training score answered by the examinee.

And 408, determining a first loss value according to the training score and the label score, and updating the initial spoken language scoring model according to the first loss value and the second loss value to obtain a spoken language scoring model.

The optimization loss function of the fusion module can be shown as formula (4).

Where n is the number of test taker responses in all training samples, Y is the label score for each training sample, pred _fuse Represents the training score, L, obtained for each training sample _scoring Representing a first loss value.

And after the first loss value and the second loss value are obtained, updating the initial spoken language scoring model according to the first loss value and the second loss value. If so, determining a first coefficient and a second coefficient, and respectively carrying out weighted summation processing on the first loss value and the second loss value based on the first coefficient and the second coefficient to obtain an overall loss value; and updating the initial spoken language scoring model according to the overall loss value, wherein the sum of the first coefficient and the second coefficient is 1, and the first coefficient is larger than the second coefficient.

The training loss function of the whole spoken language scoring model is shown in formula (5).

L _tot ＝(1-α)L _scoring +αL _keypoint (5)

Wherein 1-alpha is a first coefficient and alpha is a second coefficient. Wherein 1-alpha > alpha. In practice, α may be 0.1.

The purpose of clearly defining that the first coefficient is larger than the second coefficient is to keep the characteristic that the main points match the pre-training model, namely, the context information is used to represent the similarity information (hidden representation) of each main point pair, but the most important is to obtain the final spoken language score, so that the first coefficient is larger than the second coefficient.

According to the method for training the spoken language scoring model in the embodiment, the obtained spoken language scoring model can improve the stability and accuracy of machine scoring in the spoken language test.

All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present application, and are not described herein again.

In order to better implement the answer scoring method of the embodiment of the present application, an answer scoring device is further provided in the embodiment of the present application. Please refer to fig. 9, fig. 9 is a schematic structural diagram of an answer scoring device according to an embodiment of the present disclosure. The answer scoring device 500 may include a first acquisition unit 501, a voice presentation unit 502, a language presentation unit 503, a second acquisition unit 504, a main point presentation unit 505, and a scoring unit 506.

The first obtaining unit 501 is configured to obtain a voice of an examinee answering in a spoken language test, and a text sequence of a full text of the examinee answering and a phoneme sequence of the full text of the examinee answering, which are determined according to the voice.

A speech representation unit 502, configured to determine a full-text speech level representation of the answer of the examinee according to the speech and the phoneme sequence.

A language representation unit 503, configured to determine a full-text language level representation of the answer of the examinee according to the text sequence.

A second obtaining unit 504, configured to obtain a plurality of paradigm principal points of the spoken language test.

A gist representation unit 505 for determining a global gist representation of the answer of the test taker according to the plurality of paradigm gist and the text sequence.

And the scoring unit 506 is used for scoring the examinee response according to the full-text speech level representation, the full-text language level representation and the global main point representation so as to obtain a score of the examinee response.

In order to better implement the spoken language scoring model training method of the embodiment of the application, the embodiment of the application also provides a spoken language scoring model training device. Referring to fig. 10, fig. 10 is a schematic structural diagram of an answer scoring device according to an embodiment of the present disclosure. The answer scoring device 600 may include a first training acquisition unit 601, a training speech presentation unit 602, a training language presentation unit 603, a second training acquisition unit 604, a training gist presentation unit 605, a training scoring unit 506, and an updating unit 507.

The first training obtaining unit 601 is configured to obtain a training data set and an initial spoken language scoring model, where the training data set includes a plurality of training samples in a spoken language examination, and each training sample includes training speech answered by each examinee, label scoring answered by the examinee, a training text sequence of a full text answered by the examinee determined according to the training speech, and a training phoneme sequence of the full text answered by the examinee.

A training speech representation unit 602, configured to input the speech and the phoneme sequence into a speech level modeling module of the initial spoken language scoring model for encoding and decoding, so as to determine a training full-text speech level representation answered by the examinee.

And a training language representation unit 603, configured to input the training text sequence into a language level modeling module of the initial spoken language scoring model to perform language extraction processing, so as to determine a training full-text language level representation of the answer of the test taker.

A second training obtaining unit 604, configured to obtain a plurality of training paradigm principal points of the spoken language examination.

A training main point representation unit 605, configured to input a plurality of training paradigm main points and the training text sequence into the main point processing module of the initial spoken language scoring module to perform main point semantic processing, so as to determine a training global main point representation to be answered by the examinee.

And a training scoring unit 606, configured to input the full-text training speech level representation, the full-text training language level representation, and the global training key representation into a fusion module of the initial spoken language scoring module for scoring, so as to obtain a training score of the answer of the examinee.

An updating unit 607, configured to update the initial spoken language scoring model according to the training score and the tag score to obtain a spoken language scoring model.

All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present application, and the beneficial effects that can be achieved are referred to in the foregoing, which are not described in detail herein.

Correspondingly, the embodiment of the application also provides a computer device, and the computer device can be a terminal or a server. As shown in fig. 11, fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer apparatus 700 includes a processor 701 having one or more processing cores, a memory 702 having one or more computer-readable storage media, and a computer program stored on the memory 702 and executable on the processor. The processor 701 is electrically connected to the memory 702.

The processor 701 is a control center of the computer apparatus 700, connects various parts of the entire computer apparatus 700 with various interfaces and lines, performs various functions of the computer apparatus 700 and processes data by running or loading software programs (computer programs) and/or modules stored in the memory 702, and calling data stored in the memory 702, thereby monitoring the computer apparatus 700 as a whole.

In this embodiment, the processor 701 in the computer device 700 loads instructions corresponding to one or more processes of the application program into the memory 702, and the processor 701 executes the application program stored in the memory 702, so as to implement the functions of any of the above-described method embodiments, including the steps in any of the above-described answer scoring methods and/or the steps in any of the above-described spoken language scoring model training methods, for example. Please refer to the above description.

For specific implementation and beneficial effects of each operation/step that can be performed by the processor, reference may be made to the foregoing method embodiments, and details are not described herein.

Optionally, as shown in fig. 11, the computer device 700 further includes: a touch display screen 703, a radio frequency circuit 704, an audio circuit 705, an input unit 706, and a power supply 707. The processor 701 is electrically connected to the touch display screen 703, the radio frequency circuit 704, the audio circuit 705, the input unit 706, and the power source 707. Those skilled in the art will appreciate that the computer device architecture illustrated in FIG. 11 is not intended to be limiting of computer devices and may include more or less components than those illustrated, or combinations of certain components, or different arrangements of components.

The touch display screen 703 may be used to display a graphical user interface and receive operation instructions generated by a user acting on the graphical user interface. The touch display screen 703 may include a display panel and a touch panel. Among other things, the display panel may be used to display information input by or provided to a user as well as various graphical user interfaces of the computer device, which may be made up of graphics, text, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations of a user on or near the touch panel (for example, operations of the user on or near the touch panel using any suitable object or accessory such as a finger, a stylus pen, and the like), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. The touch panel may cover the display panel, and when the touch panel detects a touch operation thereon or nearby, the touch panel transmits the touch operation to the processor 701 to determine the type of the touch event, and then the processor 701 provides a corresponding visual output on the display panel according to the type of the touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 703 to realize input and output functions. However, in some embodiments, the touch panel and the touch panel can be implemented as two separate components to perform the input and output functions. That is, the touch display screen 703 can also be used as a part of the input unit 706 to implement an input function.

In the embodiment of the present application, the touch display screen 703 is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface.

The rf circuit 704 may be configured to transmit and receive rf signals to establish wireless communication with a network device or other computer device through wireless communication, and to transmit and receive signals with the network device or other computer device.

Audio circuitry 705 may be used to provide an audio interface between a user and a computer device through speakers and microphones. The audio circuit 705 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 705 and converted into audio data, which is then processed by the output processor 701 and transmitted to, for example, another computer device via the radio frequency circuit 704, or output to the memory 702 for further processing. The audio circuitry 705 may also include an earbud jack to provide communication of peripheral headphones with the computer device.

The input unit 706 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 707 is used to power the various components of the computer device 700. Optionally, the power supply 707 may be logically connected to the processor 701 through a power management system, so as to implement functions of managing charging, discharging, power consumption management, and the like through the power management system. The power supply 707 can also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown in fig. 11, the computer device 700 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described in detail herein.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, the embodiments of the present application provide a computer-readable storage medium, in which a plurality of computer programs are stored, and the computer programs can be loaded by a processor to execute the steps in any answer scoring method provided in the embodiments of the present application. For example, the computer program may perform the steps of any of the embodiments of answer scoring methods described above, and/or the steps of any of the embodiments of spoken language scoring model training methods described above.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the storage medium can execute the steps in any answer scoring method provided in the embodiments of the present application, the beneficial effects that can be achieved by any answer scoring method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The answer scoring method, the answer scoring device, the storage medium and the computer device provided by the embodiment of the application are introduced in detail, a specific example is applied in the description to explain the principle and the implementation manner of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An answer scoring method, comprising:

2. The method according to claim 1, wherein the step of determining a global gist representation of the test taker's response from the plurality of exemplar textual points and the text sequence comprises:

performing gist extraction processing on each of the paradigm gist points and the text sequence to determine examinee answering gist corresponding to each paradigm gist point;

and (3) performing key point level attention semantic extraction processing on each model essay key point and the corresponding answer key point of the examinee to obtain global key point representation of the answer of the examinee.

3. The method according to claim 2, wherein the step of performing a gist-level attention semantic extraction process on each model essay gist point and the corresponding answer gist point of the test taker to obtain a global gist representation of the answer of the test taker comprises:

performing key point level semantic extraction processing on each model essay key point and the corresponding examinee answering key point to obtain full-text key point semantic representation of the answer of the examinee and hidden layer semantic representation of context similarity between each model essay key point and the corresponding answer key point of the examinee;

and performing attention mechanism processing on the full-text main point semantic representation and the hidden layer semantic representation to obtain a global main point representation after the attention weight weighting of the hidden layer semantic representation.

4. The method according to claim 3, further comprising a spoken language scoring model, wherein the spoken language scoring model comprises a gist-level semantic error matching module, and the step of performing gist-level semantic extraction processing on each model essay gist point and the corresponding examinee answering gist point to obtain a full-text gist semantic representation answered by the examinee and hidden-layer semantic representations of a plurality of model essay essences comprises:

performing text splicing on each model essay key point and the corresponding examinee answering point to obtain a text of each key point pair, wherein the text of each key point pair comprises a starting identifier;

inputting a global key point representation identifier and a plurality of key point pair texts into a key point matching pre-training model in the key point level semantic error matching module, and performing key point level semantic extraction processing to obtain full text key point semantic representation answered by the examinee and hidden layer semantic representation corresponding to the texts by the key point pairs;

and extracting hidden-layer semantic representations of the text at positions corresponding to the start identifiers of the texts of each main point pair in the texts as hidden-layer semantic representations of the context similarity between the texts of each main point pair.

5. The method according to claim 3, further comprising a spoken scoring model including a point importance module, wherein the step of attention-based processing of the full-text point semantic representation and the hidden layer semantic representation to obtain a global point representation after weighting by attention weights of the hidden layer semantic representation comprises:

splicing the full-text main point semantic representation and the hidden layer semantic representation to obtain spliced semantic representation;

inputting the spliced semantic representation into a main point importance model of the main point importance module for attention weight processing to obtain attention weights corresponding to the full-text main point semantic representation and the hidden layer semantic representations;

and carrying out weighting processing according to the attention weights corresponding to the hidden layer semantic representations and the corresponding hidden layer semantic representations, and determining the global main point representation after the attention weights of the hidden layer semantic representations are weighted.

6. The method according to claim 2, further comprising a spoken language scoring model, wherein the spoken language scoring model comprises a gist-level semantic error module, and wherein the step of performing gist extraction processing on each of the plurality of paradigm gist and the text sequence to determine a question answering point corresponding to each of the paradigm gist comprises:

splicing each model essay main point in the plurality of model essay main points and the text sequence to obtain a plurality of spliced text sequences;

and inputting each spliced text sequence in the spliced text sequences into a main point extraction model in the main point level semantic right-wrong module to extract and process main points so as to obtain the answer main points of the examinees corresponding to each template essential point.

7. The method of claim 1, wherein the step of obtaining a plurality of paradigm shift points of the spoken language examination comprises:

acquiring an examination picture of a picture-viewing and speaking question corresponding to a spoken language examination;

carrying out picture coding processing on the test picture to obtain a picture coding result;

and generating a plurality of model essay key points of the test picture according to the picture coding result.

8. The method of claim 1, further comprising a spoken language scoring model, the spoken language scoring model including a fusion module, the step of scoring the test taker response from the full-text speech level representation, the full-text language level representation, and the global gist representation to obtain a score for the test taker response comprising:

performing fusion processing on the full-text speech horizontal representation, the full-text language horizontal representation and the global main point representation to obtain a fusion hidden layer representation fused with the main point representation;

and inputting the fusion hidden layer representation into a fusion model in the fusion module for nonlinear scoring processing to obtain the score of the examinee response.

9. The method of claim 1, wherein said step of determining a full-text speech-level representation of said candidate answer from said speech and said sequence of phonemes comprises:

dividing the voice into a plurality of voice segments, and determining a plurality of phoneme segments corresponding to the plurality of voice segments;

coding and decoding each voice segment in the plurality of voice segments and each phoneme segment corresponding to each voice segment in the plurality of phoneme segments to obtain a hidden layer representation of each voice segment;

and processing the obtained hidden layer representations of the plurality of voice fragments according to the phoneme level to obtain a full-text voice level representation of the answer of the examinee.

10. A spoken language scoring model training method is characterized by comprising the following steps:

and updating the initial spoken language scoring model according to the training score and the label score to obtain a spoken language scoring model.

11. The method of claim 10, wherein said gist processing module comprises a gist level semantic alignment module and a gist importance module, and wherein said step of inputting a plurality of training paradigm key points and said training text sequence into said gist processing module of said initial spoken language scoring module for gist semantic processing to determine a training global gist representation to be answered by said test taker comprises:

inputting each training model essay principal point in a plurality of training model essay principal points and the training text sequence into a principal point extraction model of the principal point level semantic error module to perform principal point extraction processing so as to determine an examinee answering training principal point corresponding to each training model essay principal point;

inputting each training paradigm principal point and the corresponding examinee answering training principal point into a principal point matching pre-training model of the key point level semantic pair module to perform key point level semantic extraction processing so as to obtain training overall principal point semantic representation of the examinee answering and training hidden layer semantic representation of context similarity between each training paradigm principal point and the corresponding examinee answering training principal point;

and inputting the training full-text main point semantic representation and the training hidden layer semantic representation into a main point importance model of the main point importance module for attention mechanism processing so as to obtain a training global main point representation after the attention weight of the training hidden layer semantic representation is weighted.

12. The method of claim 11, further comprising: when the key point extraction processing is carried out, a first similarity between each training model essay key point and the test taker answering training key point is obtained;

after obtaining the training hidden layer semantic representation, the method further includes:

inputting the main point semantic representation and the training hidden layer semantic representation of the training full text into a newly added linear mapping layer for linear mapping processing to obtain a second similarity corresponding to each training hidden layer semantic representation;

determining a second loss value according to the first similarity and the second similarity;

the step of updating the initial spoken language scoring model according to the training score and the tag score to obtain a spoken language scoring model comprises:

determining the first loss value from the training score and the label score;

updating the initial spoken language scoring model according to the first loss value and the second loss value to obtain a spoken language scoring model.

13. The method of claim 12, wherein said step of updating said initial spoken language scoring model based on said first loss value and said second loss value to obtain a spoken language scoring model comprises:

determining a first coefficient and a second coefficient, and respectively performing weighted summation processing on the first loss value and the second loss value based on the first coefficient and the second coefficient to obtain an overall loss value;

updating the initial spoken language scoring model according to the overall loss value, wherein the sum of the first coefficient and the second coefficient is 1, and the first coefficient is greater than the second coefficient.

14. An answer scoring device, comprising:

a speech representation unit for determining a full-text speech level representation of the answer of the examinee according to the speech and the phoneme sequence;

15. A spoken language scoring model training device is characterized by comprising:

the training speech representation unit is used for inputting the speech and the phoneme sequence into a speech level modeling module of the initial spoken language scoring model for coding and decoding processing so as to determine a training full-text speech level representation of the answer of the examinee;

the training scoring unit is used for inputting the full-text training speech level representation, the full-text training language level representation and the global training key representation into a fusion module of the initial spoken language scoring module for scoring processing so as to obtain a training score of the answer of the examinee;

and the updating unit is used for updating the initial spoken language scoring model according to the training score and the label score so as to obtain a spoken language scoring model.

16. A computer-readable storage medium, characterized in that it stores a computer program adapted to be loaded by a processor for performing the steps of the method according to any one of claims 1-13.

17. A computer device, characterized in that it comprises a memory in which a computer program is stored and a processor which performs the steps in the method according to any one of claims 1-13 by calling the computer program stored in the memory.