CN113672716A

CN113672716A - Geometric question answering method and model based on deep learning and multi-mode numerical reasoning

Info

Publication number: CN113672716A
Application number: CN202110982368.2A
Authority: CN
Inventors: 梁小丹; 李橦; 李奇文; 陈嘉奇
Original assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Current assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2021-11-19

Abstract

The invention discloses a geometric question solving method based on deep learning and multi-modal numerical reasoning and a text and image bimodal combined neural network model, wherein the method comprises the following steps: respectively acquiring text information and image information about the subject content; encoding the text information into a corresponding text hidden state to obtain text semantics, and encoding the image information into a corresponding image hidden state to obtain visual semantics; fusing and aligning the text semantics and the visual semantics to obtain a solution program; and calculating an answer result according to the operation mode of the answer program. The invention can not only increase the answer accuracy, but also improve the processing efficiency, thereby realizing a technology which has good accuracy and strong practicability and can independently generate the code sequence of the answer through deep learning.

Description

Geometric question answering method and model based on deep learning and multi-mode numerical reasoning

Technical Field

The invention relates to the technical field of intelligent education, in particular to a geometric question answering method based on deep learning and multi-mode numerical reasoning and a text and image bimodal combined neural network model.

Background

With the development and popularization of artificial intelligence, artificial intelligence has been applied to various industries, and one application field is intelligent education.

At present, one of the most common applications is intelligent answering, and the operation mode is that a user shoots a corresponding topic picture, and a large topic library constructed by massive topics is searched based on topic contents by identifying the topic contents in the picture, so that a corresponding answer is found.

However, the currently used method has the following technical problems: the related questions are numerous, the answer mode of each question is changed once the parameters or data of each question are changed, so that more answers are derived, if answer searching is carried out only through identifying images, answers input by a single user can be screened from a large number of answers, the extended learning of students is not facilitated, the number of data to be processed is large, the processing time is increased, the processing efficiency is reduced, and if the questions are similar, the condition of wrong screening is easy to occur, the screening accuracy is reduced, and the use experience of the user is influenced.

Disclosure of Invention

The invention provides a geometric question answering method based on deep learning and multi-modal numerical reasoning and a text and image bimodal combined neural network model.

The embodiment of the invention provides a geometric problem solving method based on deep learning and multi-modal numerical reasoning, which is applied to a text and image bimodal combined neural network model and comprises the following steps:

respectively acquiring text information and image information about the subject content;

encoding the text information into a corresponding text hidden state to obtain text semantics, and encoding the image information into a corresponding image hidden state to obtain visual semantics;

fusing and aligning the text semantics and the visual semantics to obtain a solution program;

and calculating an answer result according to the operation mode of the answer program.

In a possible implementation manner of the first aspect, the fusing and aligning the text semantics and the visual semantics to obtain a solution program includes:

respectively coding the text semantics and the visual semantics and outputting coded text semantics and coded visual semantics representing a hidden state;

aligning the coding text semantics and the coding visual semantics by using an attention mechanism in deep learning to obtain aligned semantic data;

and inputting the alignment semantic data into two preset multilayer perceptrons to obtain an aggregated multimodal feature vector, and constructing a solution program by using the multimodal feature vector.

In a possible implementation manner of the first aspect, the calculating a solution result according to an operation manner of the solution program includes:

screening program sequences from the solution program;

acquiring operators and operation data contained in the program sequence;

and calculating according to the program sequence, the operator and the operation data to obtain a solution result.

In one possible implementation manner of the first aspect, the filter sequence from the solution program includes:

decoding the text semantics and the visual semantics by using a preset LSTM decoder to obtain decoded information;

inputting the decoding information into a preset full-connection layer to obtain an initial state, and connecting the initial state and a preset attention mechanism in series to obtain a decoding hidden state of the LSTM decoder;

inputting the decoding hidden state to a preset full-link layer through a preset softmax function to predict a plurality of preset sequences;

presetting a probability value of each preset sequence by using preset negative log likelihood estimation to obtain a plurality of probability values;

and screening the probability value with the maximum value from the plurality of probability values, and taking a preset sequence corresponding to the probability value with the maximum value as a program sequence.

In a possible implementation manner of the first aspect, the encoding the image information into the corresponding image hidden state to obtain a visual semantic includes:

and calling the first three layers of the residual error neural network trained by the model to encode the image information into a corresponding image hidden state to obtain visual semantics.

In one possible implementation manner of the first aspect, the model training includes puzzle position prediction training, geometric element prediction training, and knowledge point classification training;

the jigsaw position prediction training specifically comprises the following steps: cutting the image of the image information into a plurality of image blocks, randomly cutting one image block and predicting the position information of the cut image block in the image;

the geometric element prediction training specifically comprises the following steps: inputting the image information into a residual error neural network, so that the residual error neural network can predict geometric elements contained in the image information;

the knowledge point classification training specifically comprises the following steps: and extracting geometric elements from the text information and taking the geometric elements in the preset question answers as model training labels, deploying an N-type classifier, and performing training by using a loss function of binary cross entropy, wherein N is the total number of the geometric elements.

In a possible implementation manner of the first aspect, the loss function of the model training is as follows:

wherein the loss function L_gThe negative log likelihood estimation of the target program sequence is used for calculating the probability value of each preset sequence;

in the above equation, θ is a parameter of the entire NGS structure without the graphics encoder, and x is an input of the program text and the extracted graphics feature.

In a possible implementation manner of the first aspect, the encoding the text information into a corresponding text hidden state to obtain a text semantic includes:

converting each word in the text information into a word vector;

inputting each word vector into a preset single-layer non-bidirectional LSTM model to obtain a hidden state corresponding to each word;

and coding the sequence of each hidden state to obtain text semantics.

A second aspect of the embodiments of the present invention provides a neural network model for bimodal combination of text and images, the neural network model is suitable for the geometric problem solution method based on deep learning and multimodal numerical reasoning as described above, and the neural network model includes: a text encoder, an image encoder, a joint reasoning module and a program decoder;

wherein the text encoder, the image encoder and the program decoder are respectively connected with the joint reasoning module;

the text encoder is used for acquiring text information related to the title content and encoding the text information into a corresponding text hidden state to obtain text semantics;

the image encoder is used for acquiring image information related to the subject content and encoding the image information into a corresponding image hidden state to obtain visual semantics;

the joint reasoning module is used for fusing and aligning the text semantics and the visual semantics to obtain an answer program;

and the program decoder is used for calculating the solution result according to the operation mode of the solution program.

In one possible implementation of the second aspect, the joint reasoning module comprises 12 self-attention units and 6 attentive units;

the 6 self-attention units are used for respectively coding the text semantics and the visual semantics and outputting coded text semantics and coded visual semantics representing a hidden state;

the system comprises 6 self-attention units and 6 guiding attention units, wherein the self-attention units and the guiding attention units are used for using an attention mechanism in deep learning to correspond coded text semantics to coded visual semantics to obtain aligned semantic data, inputting the aligned semantic data to two preset multilayer perceptrons to obtain an aggregated multimodal feature vector, and constructing a solution program by using the multimodal feature vector.

Compared with the prior art, the geometric problem solving method and model based on deep learning and multi-modal numerical reasoning provided by the embodiment of the invention have the beneficial effects that: the invention can acquire text information and image information related to the subject content, acquire the semantics contained in the text information and the image information, and then fuse and align the text semantics and the visual semantics of the image, thereby preparing to solve the subject content and generating a corresponding solution program, and finally calculating the solution result according to the operation mode of the solution program, thereby not only increasing the solution accuracy, but also improving the processing efficiency, and realizing the technology which has good accuracy, strong practicability and can autonomously generate the code sequence of the answer through deep learning.

Drawings

Fig. 1 is a schematic flow chart of a geometric question answering method based on deep learning and multi-modal numerical reasoning according to an embodiment of the present invention;

FIG. 2 is a block diagram of an answer program according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a program code composition of an answer program according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a geometric problem solving system based on deep learning and multi-modal numerical reasoning according to an embodiment of the present invention;

FIG. 5 is a structural diagram of a neural network model with bimodal union of texts and images according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The current common mode has the following technical problems: the related questions are numerous, the answer mode of each question is changed once the parameters or data of each question are changed, so that more answers are derived, if answer searching is carried out only through identifying images, answers input by a single user can be screened from a large number of answers, the extended learning of students is not facilitated, the number of data to be processed is large, the processing time is increased, the processing efficiency is reduced, and if the questions are similar, the condition of wrong screening is easy to occur, the screening accuracy is reduced, and the use experience of the user is influenced.

In order to solve the above problem, a geometric problem solving method based on deep learning and multi-modal numerical reasoning provided by the embodiments of the present application will be described and explained in detail by the following specific embodiments.

Referring to fig. 1, a schematic flow chart of a geometric question answering method based on deep learning and multi-modal numerical reasoning according to an embodiment of the present invention is shown.

In one embodiment, the method is applied to a neural network model in which text and images are bimodal.

By way of example, the geometric problem solving method based on deep learning and multi-modal numerical reasoning may include:

s11, respectively obtaining text information and image information about the title content.

The text information is the text information of the title, and the image information is the geometric image of the title.

In practice, the present application may be applied to the solution of geometric subjects, and optionally, to planar geometry and solid geometry.

Specifically, the user can directly input the text information and the image information of the title into the neural network model with the bimodal union of the text and the image, so that the neural network model with the bimodal union of the text and the image can perform corresponding solution operation.

S12, encoding the text information into a corresponding text hidden state to obtain text semantics, and encoding the image information into a corresponding image hidden state to obtain visual semantics.

In order to determine the specific content of the question and perform intelligent learning according to the specific content of the question to generate a correct answer, the text information may be encoded into a corresponding text hidden state to obtain text semantics, and the image information may be encoded into a corresponding image hidden state to obtain visual semantics.

In order to accurately derive text semantics, in an alternative embodiment, step S12 may include the following sub-steps:

substep S121, converting each word in the text information into a word vector.

And a substep S122, inputting each word vector into a preset single-layer non-bidirectional LSTM model to obtain a hidden state corresponding to each word.

And S123, coding the sequence of each hidden state to obtain text semantics.

For example, assume that a text sequence of text information is

Each word x in the text information may be represented by_iTreated as a word vector x_iThen, a single-layer non-bidirectional LSTM model (long-short term memory artificial neural network) is used for coding each word embedding to obtain a corresponding hidden state h_iThen, the hidden coding state of word embedding in the LSTM model is input to obtain the hidden state H of the whole problem sequence_P＝[h₀；...；h_n]And obtaining the corresponding text semantics.

In order to accurately obtain the corresponding visual semantics of the image, in an embodiment, the step S12 may further include the following sub-steps:

and a substep S124 of calling the first three layers of the residual error neural network trained by the model to encode the image information into a corresponding image hidden state to obtain visual semantics.

Specifically, the first three layers of the residual neural network may be the first three layers of the ResNet-101 neural network.

In order to improve the reading capability of the neural network, in one embodiment, the neural network may be subjected to corresponding model training.

Specifically, the model training comprises jigsaw position prediction training, geometric element prediction training and knowledge point classification training;

the jigsaw position prediction training specifically comprises the following steps: and cutting the image of the image information into a plurality of image blocks, randomly cutting one image block and predicting the position information of the cut image block in the image.

Specifically, the puzzle position prediction training may be trained as follows: to pre-train the chart encoder. In performing the tile location prediction task at the pixel level for each sensing, the image may be split into m × m color blocks, and the color blocks are randomly selected. The chart encoder is then trained to predict the correct relative position of the selected patches and generate a cross-entropy penalty. If the cross entropy loss is large, the model can be judged to be under-fitted for model training, and the answer accuracy of the questions given by the model is low; on the contrary, if the cross entropy loss is too small, the model can be judged to be over-trained and fitted, and the model "remembers" the answers of the training set; if the cross entropy loss is at an acceptable intermediate value, the model response is more accurate.

The geometric element prediction training specifically comprises the following steps: and inputting the image information into a residual error neural network, so that the residual error neural network can predict the geometric elements contained in the image information.

Specifically, since one graph may include a plurality of geometric elements, in order to improve the prediction accuracy of the geometric elements, the geometric element prediction training is trained in the following manner: geometric elements are first extracted as labels, which can be geometric elements in the topic text and geometric elements in the topic answer. The graph encoder is then trained using an N-way classifier with Binary Cross Entropy (BCE) as a loss function, where N is the number of possible geometric elements on the training set.

In this embodiment, the weight of the loss function of the tile position prediction training and the geometric element prediction training may be set to 1.0.

Specifically, in order to increase the overall perception of the problem by the model, the specific way of training the knowledge point classification is as follows: a data set is preset, which can summarize and assemble a plurality of knowledge points, and mark each question with one or more knowledge points. Next, knowledge points for each question are predicted based on the output solver. In addition, a K-way classifier with Binary Cross Entropy (BCE) as a loss function can be deployed to train the knowledge point prediction multi-label task, wherein K is the total number of possible knowledge points on the training set.

After the training tasks are completed, the auxiliary model can be better trained, each training task respectively improves the capability of each part of the model, and the capability of the model for searching geometric elements in the picture can be enhanced through jigsaw position prediction training; the geometric element prediction training can enhance the comprehension capability of the model on the topic; knowledge point classification training can enhance the ability of the model to use correct problem solving formulas when solving problems.

Specifically, the loss function of the model training is shown as follows:

wherein the loss function L_gThe negative log likelihood estimation of the target program sequence is used for calculating the probability value of each preset sequence in the follow-up process;

And S13, fusing and aligning the text semantics and the visual semantics to obtain a solution program.

For solving the problem of geometric multi-item selection, it is important to understand the semantics of the problem text and the graph together and align the semantic information. The two semantics can be aligned to determine the content of the question, so that a corresponding answering program can be generated to obtain the correct answer.

In an alternative embodiment, an attention mechanism can be adopted to perform transmission and aggregation of two semantics, and finally, an operation program of an inference module output solution is obtained by combining text information and image information.

To facilitate the alignment and fusion of the two semantics, in one embodiment, step S13 may include the following sub-steps:

and a substep S131, respectively encoding the text semantics and the visual semantics and outputting encoded text semantics and encoded visual semantics representing a hidden state.

And a substep S132, aligning the coding text semantics and the coding visual semantics by using an attention mechanism in deep learning to obtain alignment semantic data.

And a substep S133 of inputting the alignment semantic data into two preset multilayer perceptrons to obtain an aggregated multimodal feature vector, and constructing a solution program by using the multimodal feature vector.

In actual practice, 12 self-attention units and 6 guiding attention units may be provided. Firstly, 6 self-attention units (namely original transformers) encode text semantics and visual semantics, and then the encoded text semantics and the encoded visual semantics in the final hidden state output by the 6 th self-attention unit are taken as guide information, and the guide information can perform sufficient fusion and alignment of semantic representation and visual semantics between a problem text and a diagram thereof.

In an optional embodiment, an attention mechanism in deep learning can be used in the process of fusing and aligning the text semantics and the visual semantics by 6 self-attention units, and the mechanism can correspond the input text representation to the input visual representation, so that the alignment of the text semantic representation and the visual semantic representation is realized, and the model capability and the accuracy of solving the problem are further improved.

Specifically, the remaining 6 self-attention units and 6 guide attention units can be stacked with each other, and corresponding aligned and fused data are output together, so that alignment and fusion of text semantic representation and visual semantic representation are enhanced, the model can better understand text information and visual information of the title, and the model can be higher in correct rate of solving the title by combining the text semantic representation and the visual semantic representation.

In one embodiment, the present application also introduces two-layer multi-tier perceptrons, and an attentiveness-reducing network with two multi-tier perceptrons may be applied to aggregate features.

In particular, the aligned semantic data may be input into two multi-tier perceptrons, thereby outputting an aggregated feature, the aggregated feature F_DIs a multi-peak feature vector

Alternatively, canCharacteristics of polymerization

And a text encoder h_nIs concatenated to obtain the last encoder state of

As the final collected multimodal feature vector for use in subsequent solving procedures.

Referring to fig. 2, a schematic diagram of the components of the solution program according to an embodiment of the present invention is shown.

In an alternative embodiment, the guidance information may be operated when outputting rich information including related question words and diagrams, and the output information includes corresponding information of text and vision, such as: a "triangle" in the topic text corresponds to a triangle element in the topic image.

It should be noted that, besides the common mathematical operations, the solving program may also include some operations representing theorems and formula knowledge to better solve the geometric problems, such as Pythagorean theorem and area calculation of the circular ring. And may not be additionally defined for some common or simple geometric formulas. For example, for a square with a side length of a, its area can be directly calculated by Multiply (a, a).

The interpretability may also be reflected in the sequential course of operations, selected constants and variables, and the application of theorems and formulas when outputting the solution program. As shown in fig. 2, the user can have a rough understanding of the entire problem solving process after reading the program.

Referring to fig. 3, a schematic diagram of a program symbol composition of an answer program according to an embodiment of the present invention is shown. In one of the demonstration modes, a new domain-specific language can be designed to model the precise operation program corresponding to the geometric problem, for example, the vocabulary library of the solution program may include operators op (operations), a constant Const, and a variable N appearing in the text-image geometry and a variable V appearing in the running process. As shown in fig. 3, the operator OP is divided into a plurality of categories including basic operation, arithmetic operation, trigonometric function, theorem and formula. Each operator OP involves n constant or variable elements. The constant Const may be predefined, and may be used in geometric problems, such as pi or 90 degrees at a right angle. The variable N appearing in the text-image geometry depends on the specific topic, while the variable V appearing in the run depends on the specific operational procedure.

Referring to fig. 3, various operators and constants defined herein may include: basic operators: assigning, multiplying by 2, and dividing by 2; arithmetic operator: addition, subtraction, multiplication and division; trigonometric function: sin, cos, tan, arcsin, arccos; theorem and formula, Pythagorean theorem to solve the formula of bevel edge and short edge, circle area, circumference, cone area and proportion; constants are: 30 °, 60 °, 90 °, 180 °, 360 °, pi, 0.618. Through the permutation and combination of various operational characters and constants, the method can be directly operated according to the solution program, so that the solution result can be quickly and conveniently obtained.

And S14, calculating the answer result according to the operation mode of the answer program.

Referring to fig. 3, in practical operation, a final solution result may be calculated from the solution program according to the operation coincidence and the value.

In order to quickly calculate the solution result, step S14 may include the following sub-steps, as an example:

substep S141, filter the program sequence from the solution program.

Since there are various solutions, there may be multiple solutions, and in order to filter the most convenient and accurate solution, in an embodiment, the sub-step S141 may include the following sub-steps:

and a substep S1411, decoding the text semantics and the visual semantics by using a preset LSTM decoder to obtain decoded information.

And a substep S1412, inputting the decoding information to a preset full connection layer to obtain an initial state, and obtaining a decoding hidden state of the LSTM decoder by using the initial state and a preset attention mechanism in series.

Sub-step S1413, inputting the decoding hidden state to a preset fully-connected layer through a preset softmax function to predict a plurality of preset sequences.

And a substep S1414, presetting a probability value of each preset sequence by using a preset negative log-likelihood estimation to obtain a plurality of probability values.

And a substep S1415 of screening the probability value with the maximum value from the plurality of probability values, and taking a preset sequence corresponding to the probability value with the maximum value as a program sequence.

Specifically, the decoding procedure is decoded using an LSTM decoder, followed by the order of { y }_tT is 1. ltoreq. t.ltoreq.T as the target program to be generated, s_tIs the hidden state of the LSTM at time t. Applying a multimodal feature vector

Input to the linear layer to obtain an initial state s₀. Let s_tConcatenated with the above-mentioned fused results and post-input into the linear layer by the softmax function to predict the next program symbol P_tDistribution of (2). The linear layer is the full link layer, P represents a program sequence, P_tI.e. the t-th in the program sequence. During training, predicting loss constraint of negative log-likelihood (NLL) of a target program; during testing, the probability distribution of all program sets is given, and the one with the highest probability is selected, so that the corresponding program sequence is obtained.

And a substep S142, obtaining operators and operational data contained in the program sequence.

And a substep S143 of calculating according to the program sequence, the operator and the operational data to obtain a solution result.

In addition, it should be noted that once the complete solver is decoded, each operator in the solver is sequentially executed to obtain a numerical result. N solutions { g before generating a bundle₁，...，g_nAfter this, the solution procedure is executed to calculate it step by step. If g is_iWith grammatical errors (e.g. number of parameters)Not matching the current operator) or the executed value does not match any of the options in the current question, the execution process will fail. The first successfully executed program may be taken as the predictive solution and the corresponding action selected. If all N solvers fail, then the executive solver will report "no result" directly, without guessing an option.

In this embodiment, the embodiment of the present invention provides a geometric problem solution method based on deep learning and multi-modal numerical reasoning, which has the following beneficial effects: the invention can acquire text information and image information related to the subject content, acquire the semantics contained in the text information and the image information, and then fuse and align the text semantics and the visual semantics of the image, thereby preparing to solve the subject content and generating a corresponding solution program, and finally calculating the solution result according to the operation mode of the solution program, thereby not only increasing the solution accuracy, but also improving the processing efficiency, and realizing the technology which has good accuracy, strong practicability and can autonomously generate the code sequence of the answer through deep learning.

Referring to fig. 4, a schematic structural diagram of a geometric problem solving system based on deep learning and multi-modal numerical reasoning according to an embodiment of the present invention is shown.

The system is applied to a neural network model with bimodal union of text and images.

By way of example, the geometric problem solving system based on deep learning and multi-modal numerical reasoning can comprise:

an obtaining module 401, configured to obtain text information and image information about the title content respectively;

an encoding module 402, configured to encode the text information into a corresponding text hidden state to obtain a text semantic, and encode the image information into a corresponding image hidden state to obtain a visual semantic;

a fusion and alignment module 403, configured to fuse and align the text semantics and the visual semantics to obtain an answer program;

and a calculating module 404, configured to calculate an answer result according to an operation manner of the answer program.

Optionally, the fusion and alignment module is further configured to:

Optionally, the computing module is further configured to:

screening program sequences from the solution program;

acquiring operators and operation data contained in the program sequence;

Optionally, the computing module is further configured to:

Optionally, the encoding module is further configured to:

Optionally, the model training comprises puzzle position prediction training, geometric element prediction training and knowledge point classification training;

Optionally, the loss function of the model training is as follows:

Optionally, the encoding module is further configured to:

converting each word in the text information into a word vector;

and coding the sequence of each hidden state to obtain text semantics.

The embodiment of the invention also provides a neural network model with bimodal combination of texts and images, and referring to fig. 5, a schematic structural diagram of the neural network model with bimodal combination of texts and images provided by the embodiment of the invention is shown.

The neural network model is suitable for the geometric problem solving method based on deep learning and multi-modal numerical reasoning as described above,

by way of example, the neural network model of bimodal union of text and image may include: a text encoder, an image encoder, a joint reasoning module and a program decoder;

Optionally, the joint reasoning module comprises 12 self-attention units and 6 attentive units;

Further, an embodiment of the present application further provides an electronic device, including: the geometric problem solving method based on deep learning and multi-modal numerical reasoning comprises the following steps of storing a geometric problem solving program, storing a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the program to realize the geometric problem solving method based on deep learning and multi-modal numerical reasoning.

Further, the present application provides a computer-readable storage medium, which stores computer-executable instructions for causing a computer to execute the geometric problem solving method based on deep learning and multi-modal numerical reasoning as described in the above embodiments.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A geometric problem solving method based on deep learning and multi-modal numerical reasoning is characterized in that the method is applied to a neural network model with bimodal combination of texts and images, and the method comprises the following steps:

2. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 1, wherein the fusing and aligning the text semantics and the visual semantics to obtain a solving program comprises:

3. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 1, wherein the calculating a solution result according to the operation manner of the solution program comprises:

screening program sequences from the solution program;

acquiring operators and operation data contained in the program sequence;

4. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 3, wherein the program screening sequence from the solving program comprises:

5. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 1, wherein the encoding the image information into the corresponding image hidden state to obtain visual semantics comprises:

6. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 5, wherein the model training comprises jigsaw position prediction training, geometric element prediction training and knowledge point classification training;

7. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 6, wherein the loss function of the model training is as follows:

8. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 1, wherein the encoding the text information into the corresponding text hidden state to obtain the text semantics comprises:

converting each word in the text information into a word vector;

and coding the sequence of each hidden state to obtain text semantics.

9. A neural network model for bimodal union of text and images, the neural network model being suitable for the method for solving geometric problems based on deep learning and multimodal numerical reasoning according to any one of claims 1 to 8, the neural network model comprising: a text encoder, an image encoder, a joint reasoning module and a program decoder;

10. The text and image bimodal joint neural network model of claim 9, wherein the joint inference module includes 12 self-attention units and 6 mentoring units;