CN117636368A

CN117636368A - Correction method, device, equipment and medium

Info

Publication number: CN117636368A
Application number: CN202311665789.8A
Authority: CN
Inventors: 姜建荣
Original assignee: Shenzhen Xingtong Technology Co ltd
Current assignee: Shenzhen Xingtong Technology Co ltd
Priority date: 2023-12-06
Filing date: 2023-12-06
Publication date: 2024-03-01

Abstract

The present disclosure provides a correction method, apparatus, device, and medium, wherein the method includes: acquiring a user answer image corresponding to a target question and a reference answer text corresponding to the target question; performing text recognition on the user answer image to obtain an initial correction result based on a text recognition result and the reference answer text; performing multi-mode matching on the answer image of the user and the reference answer text by utilizing the multi-mode matching model obtained by pre-training to obtain an image-text matching result; and obtaining a target correction result according to the initial correction result and the image-text matching result. The correction mode provided by the method is more reliable, and the accuracy of the finally obtained correction result is effectively improved.

Description

Correction method, device, equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a modification method, apparatus, device, and medium.

Background

Along with the development of artificial intelligence, the technology capable of automatically correcting answer results of examination papers, homework and other questions of students is developed at present, so that not only can the labor cost of teachers be effectively saved, but also the waiting time for students to acquire correction results can be reduced. However, due to the reasons that the ubiquitous user answering handwriting is not standard, the formula recognition difficulty is high, and the like, the reliability of the existing automatic correction technology is poor, the problems of correction errors and the like are often caused, and the accuracy of correction results is not high.

Disclosure of Invention

To solve the above technical problems or at least partially solve the above technical problems, the present disclosure provides an modifying method, device, apparatus, and medium.

According to an aspect of the present disclosure, there is provided a correction method including: acquiring a user answer image corresponding to a target question and a reference answer text corresponding to the target question; performing text recognition on the user answer image to obtain an initial correction result based on a text recognition result and the reference answer text; performing multi-mode matching on the answer image of the user and the reference answer text by utilizing the multi-mode matching model obtained by pre-training to obtain an image-text matching result; and obtaining a target correction result according to the initial correction result and the image-text matching result.

According to another aspect of the present disclosure, there is provided an modifying apparatus comprising: the acquisition module is used for acquiring a user answer image corresponding to a target question and a reference answer text corresponding to the target question; the initial correction module is used for carrying out text recognition on the user answer image so as to obtain an initial correction result based on a text recognition result and the reference answer text; the image-text matching module is used for carrying out multi-mode matching on the user answer image and the reference answer text by utilizing the multi-mode matching model obtained by pre-training to obtain an image-text matching result; and the target correction module is used for obtaining a target correction result according to the initial correction result and the image-text matching result.

According to another aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory storing a program, wherein the program comprises instructions that when executed by the processor cause the processor to perform the correction method provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium, wherein the storage medium stores a computer program for executing the modifying method provided by the present disclosure.

According to the technical scheme provided by the embodiment of the disclosure, the obtained user answer image corresponding to the target question can be subjected to text recognition, so that an initial correction result is obtained based on the text recognition result and the reference answer text of the target question, on the basis, the user answer image and the reference answer text are subjected to multi-mode matching by further utilizing the multi-mode matching model obtained through pre-training, so that an image-text matching result is obtained, and therefore, according to the initial correction result and the image-text matching result, a more accurate and reliable target correction result is obtained. The correction mode provided by the embodiment of the disclosure is more reliable, and the accuracy of the finally obtained correction result is effectively improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic flow chart of an modifying method according to an embodiment of the disclosure;

fig. 2 is a schematic diagram of an correction flow provided in an embodiment of the disclosure;

fig. 3 is a schematic structural diagram of an correction device according to an embodiment of the disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "comprising" and variations thereof as used in this disclosure are open ended terms that include, but are not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below. It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.

Fig. 1 is a schematic flow chart of an modifying method according to an embodiment of the present disclosure, where the modifying method may be performed by a modifying apparatus, and the apparatus may be implemented by using software and/or hardware, and may be generally integrated in an electronic device. As shown in fig. 1, the method mainly includes the following steps S102 to S108:

step S102, obtaining a user answer image corresponding to the target question and a reference answer text corresponding to the target question.

In practical application, a user can answer a target question, a user can obtain and upload a user answer image in a photographing mode, or a professional device can be used for directly photographing a user answer surface in a specific occasion, so that a user answer image corresponding to the target question is obtained. The reference answer text corresponding to the target question may also be called a standard answer text or a correct answer text, and the answer content of the user of the target question needs to be modified based on the reference answer text.

Step S104, carrying out text recognition on the answer image of the user to obtain an initial correction result based on the text recognition result and the reference answer text.

The embodiment of the disclosure does not limit the text recognition technology, for example, may be a PPOCR (Paddle PaddleOptical Character Recognition, text recognition technology based on a paddlefilled framework) algorithm, and by performing text recognition on a user answer image, a text recognition result, that is, a user answer text, may be obtained, so that an initial correction result corresponding to the user answer image may be obtained based on a similarity between the text recognition result and a reference answer text, where the initial correction result may indicate that the user answer is correct or that the user answer is incorrect.

And S106, carrying out multi-mode matching on the answer image of the user and the reference answer text by utilizing the multi-mode matching model obtained by pre-training to obtain an image-text matching result.

The disclosed embodiments do not limit the structure of the multimodal matching model, and any model that can process information of different modalities (such as images and text) may be used. Illustratively, the multi-modal matching model comprises a multi-modal processing network, and a linear classification layer is connected behind the multi-modal processing network; the multi-modal processing network is used for carrying out multi-modal processing based on the user answer image and the reference answer text, and the linear classification layer is used for carrying out linear classification based on the processing result output by the multi-modal processing network so as to obtain an image-text matching result. In some specific embodiments, the multi-modal processing network may be a multi-modal language understanding model, and multi-modal unified modeling is implemented through multiple paths of transformers, so as to process data of different modal types such as images, texts, audios and the like at the same time, specifically, the multi-modal processing network may fuse the data of different modalities through the multiple paths of transformers, and meanwhile, characteristics of different modalities may also be considered, so that robustness and generalization capability of the model are effectively improved; in addition, the multi-mode processing network can use a cross-mode association mechanism to link information of different modes such as texts and images, so that the accuracy and the interpretability of the model are effectively improved. In addition, the multimode processing network provided by the embodiment of the disclosure can adopt a lightweight multipath transducer structure, so that model parameters and calculation amount can be greatly reduced, and the efficiency and training speed of a model are improved. The embodiments of the present disclosure are not limited to a particular structure of the multi-modal processing network, which may include, for example, a multi-modal base model BEIT3 network (also referred to as BEIT3 network). In conclusion, through the mode, an effective and reliable image-text matching result can be obtained.

And S108, obtaining a target correction result according to the initial correction result and the image-text matching result.

That is, in the embodiment of the present disclosure, considering that the problems of randomness of writing by the user, difficulty in recognizing the formula, and the like all result in the error of the initial correction result, instead of directly using the initial correction result as the final correction result, the image-text matching result between the user answer image and the reference answer text is adopted for further verification, so as to ensure the reliability of the final target correction result, where the target correction result may be consistent with the initial correction result or inconsistent with the initial correction result. The correction mode provided by the embodiment of the disclosure is more reliable, and the accuracy of the finally obtained correction result is effectively improved.

The embodiment of the disclosure provides a training manner of a multi-modal matching model, and the multi-modal matching model is obtained by training according to the following steps one to two:

step one, obtaining a training sample set; the training sample set comprises a positive image-text sample pair and a negative image-text sample pair; matching answer image samples corresponding to the contained question samples by the positive image-text samples; answer image samples corresponding to the contained question samples are not matched with answer text samples.

The "answer image sample and answer text sample match" indicates that the answer content in the answer image sample is consistent with or similar to the answer text sample content and is greater than a preset threshold, and the "answer image sample and answer text sample do not match" indicates that the answer content in the answer image sample is inconsistent with or similar to the answer text sample content and is less than the preset threshold. In some specific embodiments, the positive text sample pair includes a correct answer image sample corresponding to the question sample and a reference answer text sample; the negative text sample pair comprises a wrong answer image sample and a reference answer text sample corresponding to the question sample, and/or the negative text sample pair comprises a correct answer image sample and a wrong answer text sample corresponding to the question sample.

It can be appreciated that the embodiment of the disclosure not only adopts the positive image-text sample pair to perform model training, but also introduces the negative image-text sample pair to perform model training, so that the model can learn the characteristics between the matched images and texts, and also can learn the characteristics between the unmatched images and texts, and the judging capability of the model on whether the images and texts are matched can be effectively improved in a comparison learning mode.

It will be appreciated that in the manner described above, the model can generally identify a matching relationship between a simple image and text, such as user handwritten content in a user response image: f (X) =x+3, and the reference answer text is f (X) =x ² Then modelIt is easy to recognize that the user answer image and the reference answer text do not match. To further enhance the matching capabilities of the model, such as, for example, assume that the user answer image is the user handwritten content: f (X) =x+3, and the reference answer text is f (X) =x+8, confusion is easily caused for such relatively similar but substantially different graphics. In order to improve character recognition capability of a model and enable the model to have feature resolution capability of fine granularity level, in other words, in order to enable a multi-mode matching model obtained through training to still have strong recognition capability for difficult samples which are not substantially matched but have high similarity, accuracy of output results of the multi-mode matching model is further improved, negative image-text samples introduced by the embodiment of the present disclosure can be difficult samples inconvenient to recognize, and illustratively, a negative image-text sample pair is obtained through the following steps 1-2:

step 1, obtaining similarity calculation results between N answer image samples and N answer text samples in N positive image-text sample pairs to obtain N ² A personal graph-text similarity; wherein N is a preset positive integer. Such as N positive text sample pairs comprising: answer image sample 1-answer text sample 1, answer image sample 2-answer text sample 2, answer image sample 3-answer text sample 3, … … answer image sample N-answer text sample N, then for answer image sample 1, respectively calculating the image-text similarity between answer image sample 1 and answer text sample 1, answer text sample 2, answer text sample 3 … … answer text sample N, for answer image sample 2, respectively calculating the image-text similarity between answer image sample 2 and answer text sample 1, answer text sample 2, answer text sample 3 … … answer text sample N, and by analogy, respectively performing similarity calculation between N answer image samples and N answer text samples to obtain N ² And (5) image-text similarity.

Step 2, based on N ² And extracting unmatched answer image samples and answer text samples from the N answer image samples and the N answer text samples to obtain a negative-image-text sample pair by combining. The embodiment of the disclosure does not limit the number of the extracted negative image-text sample pairs, in some In the implementation example, the image-text similarity between the answer image sample and the answer text sample in the negative image-text sample pair is higher than the preset threshold, that is, the answer image sample and the answer text sample in the negative image-text sample pair are similar but not substantially matched, which can be regarded as a difficult sample pair. In the embodiment of the disclosure, not only the positive image-text sample pair consisting of N mutually matched answer image samples-answer text samples is adopted for model training, but a plurality of difficult negative samples with high similarity but no matching are additionally obtained by utilizing the N positive image-text sample pairs, so that the discrimination capability of the model for the difficult negative samples is enhanced. For ease of understanding, in some specific examples of implementation, reference may be made to (1) and/or (2) as follows:

(1) For each of the N answer image samples, based on N ² The image-text similarity obtains first M target text samples with highest image-text similarity corresponding to the answer image sample, wherein the target text samples are answer text samples which are not matched with the answer image sample in the N answer text samples; wherein M is a preset positive integer. Through the mode, N.times.M negative image-text sample pairs can be obtained.

(2) For each answer text sample in the N positive text sample pairs, based on N ² The image-text similarity obtains first K target image samples with highest image-text similarity corresponding to the answer text sample, and the target image samples are answer image samples which are not matched with the answer text sample in the N answer image samples; wherein K is a preset positive integer. Through the mode, N x K negative image-text sample pairs can be obtained.

In practical application, the negative image-text sample pairs (1) and/or (2) can be flexibly selected according to the requirement, if the negative image-text sample pairs (1) and (2) are selected at the same time, n+n+k negative image-text sample pairs can be obtained, and M and K can be the same or different, and are not limited herein. The negative image-text sample pair obtained by the method is a difficult sample pair, and the difficult sample pair is favorable for enabling the model obtained by training to have stronger discrimination capability on similar but unmatched images and texts.

Training a preset initial model by adopting a training sample set to obtain a multi-mode matching model.

Through the method, the training sample set comprises the positive image-text sample pair and the negative image-text sample pair, based on the training sample set, the contrast type language and image pre-training mode is adopted, so that the model can learn the characteristics between the matched images and texts, and also can learn the characteristics between the unmatched images and texts, and the judging capability of the model on whether the images and texts are matched or not can be effectively improved through the contrast learning mode, so that a reliable multi-mode matching model is obtained. In addition, the mode is also beneficial to learning irregular and random handwriting answer features or more complex formula features of the user by the multi-mode matching model, so that feature matching can be reliably performed with corresponding reference answer texts, and an accurate image-text matching result is finally output.

The embodiment of the disclosure further provides a specific mode of performing multi-mode matching on the answer image of the user and the reference answer text by the multi-mode matching model to obtain an image-text matching result, that is, the multi-mode matching model may perform the following steps a to b, and perform multi-mode matching on the answer image of the user and the reference answer text by the step c1 or the step c2 to obtain the image-text matching result:

and a step a, performing feature extraction and alignment operation based on the user answer image and the reference answer text to obtain alignment features. Specifically, feature extraction may be performed based on the user answer image and the reference answer text, respectively, and alignment features between the user answer image and the reference answer text may be obtained based on the extracted features of the user answer image (abbreviated as image features) and features of the reference answer text (abbreviated as text features). Illustratively, the multimodal matching model may include a BEIT3 network, and step a above may be implemented by an encoder in the BEIT3 network. In addition, before feature extraction is performed based on the user answer image and the reference answer text, respectively, preprocessing may be performed for the user answer image and the reference answer text, respectively, in a format required for performing the feature extraction operation, which is not limited herein.

And b, performing linear classification based on the alignment features to obtain a linear classification result. The linear classification result can be used for determining an image-text matching result; the image-text matching result is used for indicating that the answer image of the user is matched or not matched with the reference answer text. Illustratively, the multimodal matching model may include a BEIT3 network and a linear classification layer, through which the above step b may implement linear classification.

Illustratively, the linear classification result includes a first confidence level corresponding to the unmatched dimension and a second confidence level corresponding to the matched dimension; that is, the linear classification layer may predict the confidence levels of the matching dimension and the non-matching dimension for the alignment feature, respectively, to determine whether the user answer image matches the reference answer text based on the level of the confidence levels of the two dimensions. The matching dimension can understand that the user answer content in the user answer image is consistent with the reference answer text, and the non-matching dimension can understand that the user answer content in the user answer image is inconsistent with the reference answer text. Specifically, the manner of determining the text matching result based on the linear classification result may refer to the following step c1 or step c2:

And c1, determining that the image-text matching result indicates that the answer image of the user is not matched with the reference answer text under the condition that the first confidence coefficient is larger than the second confidence coefficient.

And c2, determining that the image-text matching result indicates that the answer image of the user is matched with the reference answer text under the condition that the first confidence coefficient is smaller than the second confidence coefficient.

By means of the confidence comparison mode, reasonable and reliable image-text matching results can be obtained.

On the basis of the foregoing, step S108, that is, obtaining the target modification result according to the initial modification result and the graph-text matching result, may be performed according to the following cases (1) and (2):

if the image-text matching result indicates that the user answer image is not matched with the reference answer text and the initial correction result indicates that the user answer is correct, determining that the target correction result indicates that the user answer is wrong if the first confidence coefficient is larger than a first preset threshold; if the first confidence coefficient is not greater than a first preset threshold value, determining that the target correction result indicates that the user answers correctly. That is, if the image-text matching result is inconsistent with the initial correction result, in order to fully ensure the reliability of the final result, the reliability of the image-text matching result is stronger only when the first confidence coefficient corresponding to the image-text matching result is greater than the first preset threshold value, and at this time, the target correction result is determined to be consistent with the image-text matching result, thereby modifying the initial correction result. If the first confidence coefficient is not greater than the first preset threshold value, determining that the target correction result indicates that the user is correct in response, and keeping the finally obtained target correction result consistent with the initial correction result.

Under the condition (2) that the image-text matching result indicates that the user answer image is matched with the reference answer text and the initial correction result indicates that the user answer is wrong, if the second confidence coefficient is larger than a second preset threshold value, determining that the target correction result indicates that the user answer is correct; if the second confidence coefficient is not greater than a second preset threshold value, determining that the target correction result indicates that the user answers the question and the answer is wrong. The values of the first preset threshold and the second preset threshold are not limited, and the first preset threshold and the second preset threshold may be the same or different, such as 0.9.

Through the mode, the reliability of the image-text matching result can be fully ensured by utilizing the constraint of the preset threshold value on the confidence coefficient, the accuracy of the finally obtained target correction result is further ensured, and the condition that the initial correction result is corrected by mistake due to the image-text matching result is effectively avoided. Moreover, for the problems of incorrect recognition of the initial correction result and the like caused by the fact that the handwriting of the user is not standard, the recognition content contains a handwriting mathematical formula with high recognition difficulty and the like, the image-text matching result with high reliability can be effectively corrected, so that correction accuracy is prompted, and a good correction spam effect is achieved.

In addition, according to the initial correction result and the graph-text matching result, the target correction result is obtained, and the method can be implemented according to the following conditions (3) and (4):

under the condition (3), when the image-text matching result indicates that the user answer image is matched with the reference answer text and the initial correction result indicates that the user answer is correct, determining that the target correction result indicates that the user answer is correct;

and (4) determining that the target correction result indicates the user answer error under the condition that the image-text matching result indicates that the user answer image is not matched with the reference answer text and the initial correction result indicates the user answer error.

In other words, by the mode, the image-text matching result can further assist in determining whether the initial correction result is accurate or not, and the problems of correction errors and the like are effectively avoided.

For ease of understanding, reference may also be made to a schematic diagram of an alteration flow as shown in fig. 2, and there may be two branches of processing for the acquired user answer image and reference answer text, where one branch of processing is: firstly, carrying out text recognition on a user answer image to obtain a user answer text (corresponding to the text recognition result), then carrying out initial correction on the user answer text by combining a reference answer text, and particularly, realizing the initial correction by adopting modes of calculating the similarity between the reference answer text and the user answer text and the like, thereby obtaining an initial correction result which can be used for indicating that the user answer is correct or incorrect. The other branch is: the information of the two different modes, namely the user answer image and the reference answer text, is directly matched by utilizing the multi-mode matching model, so that an image-text matching result is obtained, the image-text matching result is used for indicating that the user answer image and the reference answer text are matched or not matched, then the initial correction result can be subjected to secondary correction based on the image-text matching result, the specific content of the step S108 can be referred to for the secondary correction mode, the target correction result can be finally obtained, and the target correction result is possibly consistent with the initial correction result or is possibly inconsistent with the initial correction result. Through the mode, the result of comprehensively judging whether the user answers accurately or not by adopting the two modes of text similarity comparison and image-text matching is more accurate and reliable.

Further, the embodiment of the present disclosure provides the foregoing step S102, and an implementation example of obtaining the user answer image corresponding to the target question and the reference answer text corresponding to the target question may be performed with reference to the following steps a to D:

and step A, acquiring a target page image. The target page image may be, for example, a test paper image, a job image, or the like obtained by photographing. The target page image contains one or more target topics and the user responds to the topic targets.

And B, performing layout analysis on the target page image to determine the question area contained in the target page image and the user response area corresponding to the question area.

The embodiment of the disclosure does not limit the algorithm adopted by layout analysis, and for example, a PP-PicoDet detection algorithm can be adopted to detect the area of the target page image, wherein the PP-PicoDet detection algorithm is an ultra-light target detection algorithm, and the question area contained in the target page image and the user response area corresponding to the question area can be obtained efficiently and conveniently. In some specific implementation examples, the following steps B1 to B3 may be referred to:

step B1, performing preprocessing operations on the target page image, including but not limited to: one or more of an image rotation operation, an image resolution scaling operation, an image normalization operation, to convert the target page image to an image in a target format.

And step B2, positioning rectangular frames of all answer questions and rectangular frames of answers made by a user in the preprocessed target page image by using a layout analysis algorithm.

And B3, determining each question area and the corresponding user answer area of each question area based on the rectangular frames of all answer questions and the rectangular frames of the answers of the users. For example, calculating the straight line distance of the central point of each combination for the rectangular frame of each answer question and the rectangular frames of all answers of the user respectively, and selecting the rectangular frame of the answer with the smallest distance as the answer to which the answer question belongs; if the minimum center point distance is calculated to be larger than the preset distance threshold value, the question is considered to have no answer. In practical application, each answer in the target page image can be regarded as a target, or a part of the answers can be selected as a target according to the requirement, which is not limited herein.

Step C, searching a reference page image corresponding to the target page image from a preset page image library; wherein the topics in the reference page image are consistent with the topics in the target page image. For example, the target page image is a response image of a user for a certain test paper, and the page image library contains images of a plurality of test papers. In practical application, the page image library contains a plurality of original reference page images, the reference page images contain questions and reference text answers corresponding to the questions, and the target page images contain questions and user answers corresponding to the questions. Compared with a direct question searching mode, the method adopted by the embodiment of the disclosure is more efficient, and all questions in the whole scroll face can be searched at one time; and it can be understood that the textbooks adopted in different regions are different, the emphasis points of question investigation are different, and even for the same question, the selected question solving mode may be different, and because the embodiment of the disclosure directly searches the whole page image, the accuracy of the finally obtained search result is stronger, the original scroll consistent with the target page image can be directly searched, and the problems of search errors and the like caused by searching the single question can be better avoided.

In some specific implementation examples, step C may be performed with reference to the following steps:

and step C1, obtaining target image characteristics corresponding to the target page image. In practical application, the preprocessing operation required by the feature extraction algorithm can be executed on the target page image, and then the preprocessed image is input into the feature extraction algorithm, so that the target image feature is obtained. The embodiment of the disclosure does not limit the feature extraction algorithm, and can, for example, perform feature extraction on the preprocessed target page image by means of a MobileNet model (a lightweight deep neural network), so as to extract the image feature vector with distinguishing property.

And C2, acquiring reference image features corresponding to each reference page image in a preset page image library. In practical application, feature extraction can be performed on each reference page image in the page image library to obtain respective reference image features, and a page feature library corresponding to the page image library can be further formed, so that features matched with target image features can be searched from the page feature library directly.

And C3, determining a reference page image corresponding to the target page image based on the similarity between the target image features and the reference image features. For example, cosine similarity may be calculated for the target image feature and each reference image feature, respectively, and a reference page image corresponding to the reference image feature having the greatest similarity may be used as the reference page image corresponding to the target page image.

And D, determining a target question corresponding to a question area in the target page image and a reference answer text corresponding to the target question based on the question content information, the question position information and the reference answer information carried by the reference page image, and obtaining a user answer image corresponding to the target question based on a user answer area corresponding to the question area.

It can be understood that, the reference page image generally carries or is marked with tag information such as topic content information, topic position information and reference answer information corresponding to each topic, and after the reference page image is retrieved, topic correspondence can be accurately and reliably performed based on the topic position information carried by the reference page and the position of each topic area detected in the target page image. For example, the topic correspondence may be achieved by such a manner as center point straight line distance calculation based on the position of each topic area in the target page image and the position information of each topic in the reference page image, and theoretically, the positions of the same topic in the reference page image and the target page image are identical. By the method, the target questions corresponding to the question areas in the target page image and the corresponding reference answer texts can be obtained conveniently, the user answer areas corresponding to the target questions can be further cut, and the user answer image is obtained based on the cut user answer areas.

It should be noted that the above is only a specific implementation example, and is not limited thereto, for example, in practical application, text recognition may also be directly performed on the topics in the target page image, so as to obtain each topic information, and in addition, the corresponding topics and the reference answer text may also be directly searched from the topic library based on the identifiers corresponding to the identified target topics.

Through the method, the user answer image corresponding to the target question and the reference answer text corresponding to the target question can be obtained, the obtained user answer image corresponding to the target question is further subjected to text recognition, so that an initial correction result is obtained based on the text recognition result and the reference answer text of the target question, on the basis, the user answer image and the reference answer text are subjected to multi-mode matching by further utilizing the multi-mode matching model obtained through pre-training, so that an image-text matching result is obtained, the initial correction result and the image-text matching result are integrated, a more accurate and reliable target correction result can be obtained, and even if the initial correction result is wrong due to factors such as the non-standard handwriting answer result or high formula recognition difficulty, the accuracy of the finally obtained correction result can be fully improved through secondary correction of the image-text matching result.

Corresponding to the foregoing modifying method, the embodiment of the disclosure further provides a modifying device, and fig. 3 is a schematic structural diagram of the modifying device provided by the embodiment of the disclosure, where the device may be implemented by software and/or hardware, and may be generally integrated in an electronic device. As shown in fig. 3, the correction device 300 includes:

the acquiring module 302 is configured to acquire a user answer image corresponding to a target question and a reference answer text corresponding to the target question;

the initial correction module 304 is configured to perform text recognition on the user answer image, so as to obtain an initial correction result based on the text recognition result and the reference answer text;

the image-text matching module 306 is configured to perform multi-mode matching on the answer image of the user and the reference answer text by using the multi-mode matching model obtained by pre-training, so as to obtain an image-text matching result;

the target correction module 308 is configured to obtain a target correction result according to the initial correction result and the image-text matching result.

According to the device provided by the embodiment of the disclosure, text recognition can be performed on the user answer image corresponding to the acquired target question, so that an initial correction result is obtained based on the text recognition result and the reference answer text of the target question, on the basis, the user answer image and the reference answer text are subjected to multi-mode matching by further utilizing the multi-mode matching model obtained through pre-training, so that an image-text matching result is obtained, and accordingly, a more accurate and reliable target correction result is obtained according to the initial correction result and the image-text matching result. The correction device provided by the embodiment of the disclosure is more reliable, and the accuracy of the finally obtained correction result is effectively improved.

In some embodiments, the apparatus further comprises a model training module for training to obtain the multimodal matching model according to the following steps: acquiring a training sample set; wherein the training sample set comprises a positive image-text sample pair and a negative image-text sample pair; the positive text sample pair comprises a correct answer image sample and a reference answer text sample which correspond to the question sample; the negative image-text sample pair comprises a wrong answer image sample and a reference answer text sample corresponding to the question sample, and/or the negative image-text sample pair comprises a correct answer image sample and a wrong answer text sample corresponding to the question sample; and training a preset initial model by adopting the training sample set to obtain a multi-mode matching model.

In some embodiments, the model training module is further specifically configured to: obtaining similarity calculation results between N answer image samples and N answer text samples in N positive image-text sample pairs to obtain N ² A personal graph-text similarity; wherein N is a preset positive integerThe method comprises the steps of carrying out a first treatment on the surface of the Based on the N ² And extracting unmatched answer image samples and answer text samples from the N answer image samples and the N answer text samples to obtain a negative-image-text sample pair by combining.

In some embodiments, the model training module is further specifically configured to: for each of the N answer image samples, based on the N ² The image-text similarity obtains first M target text samples with highest image-text similarity corresponding to the answer image sample, wherein the target text samples are answer text samples which are not matched with the answer image sample in the N answer text samples; wherein M is a preset positive integer; and/or for each answer text sample in the N positive text sample pairs, based on the N ² The image-text similarity obtains first K target image samples with highest image-text similarity corresponding to the answer text sample, wherein the target image samples are answer image samples which are not matched with the answer text sample in the N answer image samples; wherein K is a preset positive integer.

In some embodiments, the text matching module 306 is specifically configured to: performing feature extraction and alignment operation based on the user answer image and the reference answer text to obtain alignment features; performing linear classification based on the alignment features to obtain a linear classification result; the linear classification result comprises a first confidence coefficient corresponding to the unmatched dimension and a second confidence coefficient corresponding to the matched dimension; determining that an image-text matching result indicates that the user answer image is not matched with the reference answer text under the condition that the first confidence coefficient is larger than the second confidence coefficient; and under the condition that the first confidence coefficient is smaller than the second confidence coefficient, determining that an image-text matching result indicates that the user answer image is matched with the reference answer text.

In some embodiments, the target correction module 308 is specifically configured to determine that the target correction result indicates that the user answer is wrong if the first confidence coefficient is greater than a first preset threshold when the image-text matching result indicates that the user answer image does not match the reference answer text and the initial correction result indicates that the user answer is correct; if the first confidence coefficient is not greater than a first preset threshold value, determining that the target correction result indicates that the user answer is correct; when the image-text matching result indicates that the user answer image is matched with the reference answer text and the initial correction result indicates that the user answer is wrong, if the second confidence coefficient is larger than a second preset threshold value, determining that the target correction result indicates that the user answer is correct; and if the second confidence coefficient is not greater than a second preset threshold value, determining that the target correction result indicates that the user answers the question by mistake.

In some embodiments, the target correction module 308 is specifically configured to determine that the target correction result indicates that the user answer is correct, if the image-text matching result indicates that the user answer image matches the reference answer text and the initial correction result indicates that the user answer is correct; and under the condition that the image-text matching result indicates that the user answer image is not matched with the reference answer text and the initial correction result indicates that the user answer is wrong, determining that the target correction result indicates that the user answer is wrong.

In some embodiments, the obtaining module 302 is specifically configured to: acquiring a target page image; performing layout analysis on the target page image to determine a question area contained in the target page image and a user response area corresponding to the question area; searching a reference page image corresponding to the target page image from a preset page image library; the topics in the reference page image are consistent with the topics in the target page image; and determining a target question corresponding to a question area in the target page image and a reference answer text corresponding to the target question based on the question content information, the question position information and the reference answer information carried by the reference page image, and obtaining a user answer image corresponding to the target question based on a user answer area corresponding to the question area.

In some embodiments, the obtaining module 302 is specifically configured to: acquiring target image characteristics corresponding to the target page image; acquiring reference image features corresponding to each reference page image in the preset page image library; and determining a reference page image corresponding to the target page image based on the similarity between the target image features and the reference image features.

In some embodiments, the multi-modal matching model includes a multi-modal processing network, and a linear classification layer is connected behind the multi-modal processing network; the multi-modal processing network is used for carrying out multi-modal processing based on the user answer image and the reference answer text, and the linear classification layer is used for carrying out linear classification based on the processing result output by the multi-modal processing network so as to obtain an image-text matching result.

In some embodiments, the multi-modal processing network comprises a BEIT3 network.

The correction device provided by the embodiment of the disclosure can execute the correction method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the execution method.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described apparatus embodiments may refer to corresponding procedures in the method embodiments, which are not described herein again.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The exemplary embodiments of the present disclosure also provide an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor for causing the electronic device to perform a method according to embodiments of the present disclosure when executed by the at least one processor.

The present disclosure also provides a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the present disclosure.

The present disclosure also provides a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to embodiments of the disclosure.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, on which computer program instructions are stored, which, when being executed by a processor, cause the processor to perform the correction method provided by the embodiments of the present disclosure. The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Referring to fig. 4, a block diagram of an electronic device 400 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the electronic device 400 includes a computing unit 401 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In RAM 403, various programs and data required for the operation of device 400 may also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Various components in electronic device 400 are connected to I/O interface 405, including: an input unit 406, an output unit 407, a storage unit 408, and a communication unit 409. The input unit 406 may be any type of device capable of inputting information to the electronic device 400, and the input unit 406 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. The output unit 407 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 408 may include, but is not limited to, magnetic disks, optical disks. The communication unit 409 allows the electronic device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, wiFi devices, wimax 4 devices, cellular communication devices, and/or the like.

The computing unit 401 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the respective methods and processes described above. For example, in some embodiments, the modifying method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 400 via the ROM 402 and/or the communication unit 409. In some embodiments, the computing unit 401 may be configured to perform the modifying method by any other suitable means (e.g. by means of firmware).

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of modifying, comprising:

acquiring a user answer image corresponding to a target question and a reference answer text corresponding to the target question;

performing text recognition on the user answer image to obtain an initial correction result based on a text recognition result and the reference answer text;

performing multi-mode matching on the answer image of the user and the reference answer text by utilizing the multi-mode matching model obtained by pre-training to obtain an image-text matching result;

and obtaining a target correction result according to the initial correction result and the image-text matching result.

2. The correction method as claimed in claim 1, wherein the multi-modal matching of the user answer image and the reference answer text to obtain a graph-text matching result includes:

Performing feature extraction and alignment operation based on the user answer image and the reference answer text to obtain alignment features;

performing linear classification based on the alignment features to obtain a linear classification result; the linear classification result comprises a first confidence coefficient corresponding to the unmatched dimension and a second confidence coefficient corresponding to the matched dimension;

determining that an image-text matching result indicates that the user answer image is not matched with the reference answer text under the condition that the first confidence coefficient is larger than the second confidence coefficient;

and under the condition that the first confidence coefficient is smaller than the second confidence coefficient, determining that an image-text matching result indicates that the user answer image is matched with the reference answer text.

3. The modifying method of claim 2, wherein obtaining a target modifying result according to the initial modifying result and the graph-text matching result comprises:

when the image-text matching result indicates that the user answer image is not matched with the reference answer text and the initial correction result indicates that the user answer is correct, if the first confidence coefficient is larger than a first preset threshold value, determining that the target correction result indicates that the user answer is wrong; if the first confidence coefficient is not greater than a first preset threshold value, determining that the target correction result indicates that the user answer is correct;

When the image-text matching result indicates that the user answer image is matched with the reference answer text and the initial correction result indicates that the user answer is wrong, if the second confidence coefficient is larger than a second preset threshold value, determining that the target correction result indicates that the user answer is correct; and if the second confidence coefficient is not greater than a second preset threshold value, determining that the target correction result indicates that the user answers the question by mistake.

4. The correction method according to claim 1, wherein obtaining a target correction result according to the initial correction result and the graph-text matching result comprises:

when the image-text matching result indicates that the user answer image is matched with the reference answer text and the initial correction result indicates that the user answer is correct, determining that the target correction result indicates that the user answer is correct;

and under the condition that the image-text matching result indicates that the user answer image is not matched with the reference answer text and the initial correction result indicates that the user answer is wrong, determining that the target correction result indicates that the user answer is wrong.

5. The modifying method of claim 1, wherein obtaining the user answer image corresponding to the target question and the reference answer text corresponding to the target question comprises:

Acquiring a target page image;

performing layout analysis on the target page image to determine a question area contained in the target page image and a user response area corresponding to the question area;

searching a reference page image corresponding to the target page image from a preset page image library; the topics in the reference page image are consistent with the topics in the target page image;

and determining a target question corresponding to a question area in the target page image and a reference answer text corresponding to the target question based on the question content information, the question position information and the reference answer information carried by the reference page image, and obtaining a user answer image corresponding to the target question based on a user answer area corresponding to the question area.

6. The modifying method of claim 5, wherein searching for a reference page image corresponding to the target page image from a preset page image library comprises:

acquiring target image characteristics corresponding to the target page image;

acquiring reference image features corresponding to each reference page image in the preset page image library;

and determining a reference page image corresponding to the target page image based on the similarity between the target image features and the reference image features.

7. The correction method according to claim 1, wherein the multimodal matching model is trained by:

acquiring a training sample set; wherein the training sample set comprises a positive image-text sample pair and a negative image-text sample pair; the positive text sample pair comprises a correct answer image sample and a reference answer text sample which correspond to the question sample; the negative image-text sample pair comprises a wrong answer image sample and a reference answer text sample corresponding to the question sample, and/or the negative image-text sample pair comprises a correct answer image sample and a wrong answer text sample corresponding to the question sample;

and training a preset initial model by adopting the training sample set to obtain a multi-mode matching model.

8. The correction method of claim 7, wherein the negative-text sample pair is obtained by:

obtaining similarity calculation results between N answer image samples and N answer text samples in N positive image-text sample pairs to obtain N ² A personal graph-text similarity; wherein N is a preset positive integer;

based on the N ² And extracting unmatched answer image samples and answer text samples from the N answer image samples and the N answer text samples to obtain a negative-image-text sample pair by combining.

9. The method of altering according to claim 8, wherein extracting unmatched answer image samples and answer text samples from the N answer image samples and the N answer text samples comprises:

for each of the N answer image samples, based on the N ² The image-text similarity obtains first M target text samples with highest image-text similarity corresponding to the answer image sample, wherein the target text samples are answer text samples which are not matched with the answer image sample in the N answer text samples; wherein M is a preset positive integer;

and/or the number of the groups of groups,

for each answer text sample in the N positive text sample pairs, based on the N ² The image-text similarity obtains first K target image samples with highest image-text similarity corresponding to the answer text sample, wherein the target image samples are answer image samples which are not matched with the answer text sample in the N answer image samples; wherein K is a preset positive integer.

10. An altering device comprising:

the acquisition module is used for acquiring a user answer image corresponding to a target question and a reference answer text corresponding to the target question;

The initial correction module is used for carrying out text recognition on the user answer image so as to obtain an initial correction result based on a text recognition result and the reference answer text;

the image-text matching module is used for carrying out multi-mode matching on the user answer image and the reference answer text by utilizing the multi-mode matching model obtained by pre-training to obtain an image-text matching result;

and the target correction module is used for obtaining a target correction result according to the initial correction result and the image-text matching result.

11. An electronic device, comprising:

a processor; and

a memory in which a program is stored,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the modifying method according to any one of claims 1-9.

12. A computer readable storage medium, wherein the storage medium stores a computer program for executing the correction method according to any one of the preceding claims 1-9.