CN113672716A - Geometric question answering method and model based on deep learning and multi-mode numerical reasoning - Google Patents

Geometric question answering method and model based on deep learning and multi-mode numerical reasoning Download PDF

Info

Publication number
CN113672716A
CN113672716A CN202110982368.2A CN202110982368A CN113672716A CN 113672716 A CN113672716 A CN 113672716A CN 202110982368 A CN202110982368 A CN 202110982368A CN 113672716 A CN113672716 A CN 113672716A
Authority
CN
China
Prior art keywords
text
semantics
program
image
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110982368.2A
Other languages
Chinese (zh)
Inventor
梁小丹
李橦
李奇文
陈嘉奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Sun Yat Sen University Shenzhen Campus
Original Assignee
Sun Yat Sen University
Sun Yat Sen University Shenzhen Campus
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University, Sun Yat Sen University Shenzhen Campus filed Critical Sun Yat Sen University
Priority to CN202110982368.2A priority Critical patent/CN113672716A/en
Publication of CN113672716A publication Critical patent/CN113672716A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a geometric question solving method based on deep learning and multi-modal numerical reasoning and a text and image bimodal combined neural network model, wherein the method comprises the following steps: respectively acquiring text information and image information about the subject content; encoding the text information into a corresponding text hidden state to obtain text semantics, and encoding the image information into a corresponding image hidden state to obtain visual semantics; fusing and aligning the text semantics and the visual semantics to obtain a solution program; and calculating an answer result according to the operation mode of the answer program. The invention can not only increase the answer accuracy, but also improve the processing efficiency, thereby realizing a technology which has good accuracy and strong practicability and can independently generate the code sequence of the answer through deep learning.

Description

Geometric question answering method and model based on deep learning and multi-mode numerical reasoning
Technical Field
The invention relates to the technical field of intelligent education, in particular to a geometric question answering method based on deep learning and multi-mode numerical reasoning and a text and image bimodal combined neural network model.
Background
With the development and popularization of artificial intelligence, artificial intelligence has been applied to various industries, and one application field is intelligent education.
At present, one of the most common applications is intelligent answering, and the operation mode is that a user shoots a corresponding topic picture, and a large topic library constructed by massive topics is searched based on topic contents by identifying the topic contents in the picture, so that a corresponding answer is found.
However, the currently used method has the following technical problems: the related questions are numerous, the answer mode of each question is changed once the parameters or data of each question are changed, so that more answers are derived, if answer searching is carried out only through identifying images, answers input by a single user can be screened from a large number of answers, the extended learning of students is not facilitated, the number of data to be processed is large, the processing time is increased, the processing efficiency is reduced, and if the questions are similar, the condition of wrong screening is easy to occur, the screening accuracy is reduced, and the use experience of the user is influenced.
Disclosure of Invention
The invention provides a geometric question answering method based on deep learning and multi-modal numerical reasoning and a text and image bimodal combined neural network model.
The embodiment of the invention provides a geometric problem solving method based on deep learning and multi-modal numerical reasoning, which is applied to a text and image bimodal combined neural network model and comprises the following steps:
respectively acquiring text information and image information about the subject content;
encoding the text information into a corresponding text hidden state to obtain text semantics, and encoding the image information into a corresponding image hidden state to obtain visual semantics;
fusing and aligning the text semantics and the visual semantics to obtain a solution program;
and calculating an answer result according to the operation mode of the answer program.
In a possible implementation manner of the first aspect, the fusing and aligning the text semantics and the visual semantics to obtain a solution program includes:
respectively coding the text semantics and the visual semantics and outputting coded text semantics and coded visual semantics representing a hidden state;
aligning the coding text semantics and the coding visual semantics by using an attention mechanism in deep learning to obtain aligned semantic data;
and inputting the alignment semantic data into two preset multilayer perceptrons to obtain an aggregated multimodal feature vector, and constructing a solution program by using the multimodal feature vector.
In a possible implementation manner of the first aspect, the calculating a solution result according to an operation manner of the solution program includes:
screening program sequences from the solution program;
acquiring operators and operation data contained in the program sequence;
and calculating according to the program sequence, the operator and the operation data to obtain a solution result.
In one possible implementation manner of the first aspect, the filter sequence from the solution program includes:
decoding the text semantics and the visual semantics by using a preset LSTM decoder to obtain decoded information;
inputting the decoding information into a preset full-connection layer to obtain an initial state, and connecting the initial state and a preset attention mechanism in series to obtain a decoding hidden state of the LSTM decoder;
inputting the decoding hidden state to a preset full-link layer through a preset softmax function to predict a plurality of preset sequences;
presetting a probability value of each preset sequence by using preset negative log likelihood estimation to obtain a plurality of probability values;
and screening the probability value with the maximum value from the plurality of probability values, and taking a preset sequence corresponding to the probability value with the maximum value as a program sequence.
In a possible implementation manner of the first aspect, the encoding the image information into the corresponding image hidden state to obtain a visual semantic includes:
and calling the first three layers of the residual error neural network trained by the model to encode the image information into a corresponding image hidden state to obtain visual semantics.
In one possible implementation manner of the first aspect, the model training includes puzzle position prediction training, geometric element prediction training, and knowledge point classification training;
the jigsaw position prediction training specifically comprises the following steps: cutting the image of the image information into a plurality of image blocks, randomly cutting one image block and predicting the position information of the cut image block in the image;
the geometric element prediction training specifically comprises the following steps: inputting the image information into a residual error neural network, so that the residual error neural network can predict geometric elements contained in the image information;
the knowledge point classification training specifically comprises the following steps: and extracting geometric elements from the text information and taking the geometric elements in the preset question answers as model training labels, deploying an N-type classifier, and performing training by using a loss function of binary cross entropy, wherein N is the total number of the geometric elements.
In a possible implementation manner of the first aspect, the loss function of the model training is as follows:
Figure BDA0003229412180000031
wherein the loss function LgThe negative log likelihood estimation of the target program sequence is used for calculating the probability value of each preset sequence;
in the above equation, θ is a parameter of the entire NGS structure without the graphics encoder, and x is an input of the program text and the extracted graphics feature.
In a possible implementation manner of the first aspect, the encoding the text information into a corresponding text hidden state to obtain a text semantic includes:
converting each word in the text information into a word vector;
inputting each word vector into a preset single-layer non-bidirectional LSTM model to obtain a hidden state corresponding to each word;
and coding the sequence of each hidden state to obtain text semantics.
A second aspect of the embodiments of the present invention provides a neural network model for bimodal combination of text and images, the neural network model is suitable for the geometric problem solution method based on deep learning and multimodal numerical reasoning as described above, and the neural network model includes: a text encoder, an image encoder, a joint reasoning module and a program decoder;
wherein the text encoder, the image encoder and the program decoder are respectively connected with the joint reasoning module;
the text encoder is used for acquiring text information related to the title content and encoding the text information into a corresponding text hidden state to obtain text semantics;
the image encoder is used for acquiring image information related to the subject content and encoding the image information into a corresponding image hidden state to obtain visual semantics;
the joint reasoning module is used for fusing and aligning the text semantics and the visual semantics to obtain an answer program;
and the program decoder is used for calculating the solution result according to the operation mode of the solution program.
In one possible implementation of the second aspect, the joint reasoning module comprises 12 self-attention units and 6 attentive units;
the 6 self-attention units are used for respectively coding the text semantics and the visual semantics and outputting coded text semantics and coded visual semantics representing a hidden state;
the system comprises 6 self-attention units and 6 guiding attention units, wherein the self-attention units and the guiding attention units are used for using an attention mechanism in deep learning to correspond coded text semantics to coded visual semantics to obtain aligned semantic data, inputting the aligned semantic data to two preset multilayer perceptrons to obtain an aggregated multimodal feature vector, and constructing a solution program by using the multimodal feature vector.
Compared with the prior art, the geometric problem solving method and model based on deep learning and multi-modal numerical reasoning provided by the embodiment of the invention have the beneficial effects that: the invention can acquire text information and image information related to the subject content, acquire the semantics contained in the text information and the image information, and then fuse and align the text semantics and the visual semantics of the image, thereby preparing to solve the subject content and generating a corresponding solution program, and finally calculating the solution result according to the operation mode of the solution program, thereby not only increasing the solution accuracy, but also improving the processing efficiency, and realizing the technology which has good accuracy, strong practicability and can autonomously generate the code sequence of the answer through deep learning.
Drawings
Fig. 1 is a schematic flow chart of a geometric question answering method based on deep learning and multi-modal numerical reasoning according to an embodiment of the present invention;
FIG. 2 is a block diagram of an answer program according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a program code composition of an answer program according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a geometric problem solving system based on deep learning and multi-modal numerical reasoning according to an embodiment of the present invention;
FIG. 5 is a structural diagram of a neural network model with bimodal union of texts and images according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The current common mode has the following technical problems: the related questions are numerous, the answer mode of each question is changed once the parameters or data of each question are changed, so that more answers are derived, if answer searching is carried out only through identifying images, answers input by a single user can be screened from a large number of answers, the extended learning of students is not facilitated, the number of data to be processed is large, the processing time is increased, the processing efficiency is reduced, and if the questions are similar, the condition of wrong screening is easy to occur, the screening accuracy is reduced, and the use experience of the user is influenced.
In order to solve the above problem, a geometric problem solving method based on deep learning and multi-modal numerical reasoning provided by the embodiments of the present application will be described and explained in detail by the following specific embodiments.
Referring to fig. 1, a schematic flow chart of a geometric question answering method based on deep learning and multi-modal numerical reasoning according to an embodiment of the present invention is shown.
In one embodiment, the method is applied to a neural network model in which text and images are bimodal.
By way of example, the geometric problem solving method based on deep learning and multi-modal numerical reasoning may include:
s11, respectively obtaining text information and image information about the title content.
The text information is the text information of the title, and the image information is the geometric image of the title.
In practice, the present application may be applied to the solution of geometric subjects, and optionally, to planar geometry and solid geometry.
Specifically, the user can directly input the text information and the image information of the title into the neural network model with the bimodal union of the text and the image, so that the neural network model with the bimodal union of the text and the image can perform corresponding solution operation.
S12, encoding the text information into a corresponding text hidden state to obtain text semantics, and encoding the image information into a corresponding image hidden state to obtain visual semantics.
In order to determine the specific content of the question and perform intelligent learning according to the specific content of the question to generate a correct answer, the text information may be encoded into a corresponding text hidden state to obtain text semantics, and the image information may be encoded into a corresponding image hidden state to obtain visual semantics.
In order to accurately derive text semantics, in an alternative embodiment, step S12 may include the following sub-steps:
substep S121, converting each word in the text information into a word vector.
And a substep S122, inputting each word vector into a preset single-layer non-bidirectional LSTM model to obtain a hidden state corresponding to each word.
And S123, coding the sequence of each hidden state to obtain text semantics.
For example, assume that a text sequence of text information is
Figure BDA0003229412180000071
Each word x in the text information may be represented byiTreated as a word vector xiThen, a single-layer non-bidirectional LSTM model (long-short term memory artificial neural network) is used for coding each word embedding to obtain a corresponding hidden state hiThen, the hidden coding state of word embedding in the LSTM model is input to obtain the hidden state H of the whole problem sequenceP=[h0;...;hn]And obtaining the corresponding text semantics.
In order to accurately obtain the corresponding visual semantics of the image, in an embodiment, the step S12 may further include the following sub-steps:
and a substep S124 of calling the first three layers of the residual error neural network trained by the model to encode the image information into a corresponding image hidden state to obtain visual semantics.
Specifically, the first three layers of the residual neural network may be the first three layers of the ResNet-101 neural network.
In order to improve the reading capability of the neural network, in one embodiment, the neural network may be subjected to corresponding model training.
Specifically, the model training comprises jigsaw position prediction training, geometric element prediction training and knowledge point classification training;
the jigsaw position prediction training specifically comprises the following steps: and cutting the image of the image information into a plurality of image blocks, randomly cutting one image block and predicting the position information of the cut image block in the image.
Specifically, the puzzle position prediction training may be trained as follows: to pre-train the chart encoder. In performing the tile location prediction task at the pixel level for each sensing, the image may be split into m × m color blocks, and the color blocks are randomly selected. The chart encoder is then trained to predict the correct relative position of the selected patches and generate a cross-entropy penalty. If the cross entropy loss is large, the model can be judged to be under-fitted for model training, and the answer accuracy of the questions given by the model is low; on the contrary, if the cross entropy loss is too small, the model can be judged to be over-trained and fitted, and the model "remembers" the answers of the training set; if the cross entropy loss is at an acceptable intermediate value, the model response is more accurate.
The geometric element prediction training specifically comprises the following steps: and inputting the image information into a residual error neural network, so that the residual error neural network can predict the geometric elements contained in the image information.
Specifically, since one graph may include a plurality of geometric elements, in order to improve the prediction accuracy of the geometric elements, the geometric element prediction training is trained in the following manner: geometric elements are first extracted as labels, which can be geometric elements in the topic text and geometric elements in the topic answer. The graph encoder is then trained using an N-way classifier with Binary Cross Entropy (BCE) as a loss function, where N is the number of possible geometric elements on the training set.
In this embodiment, the weight of the loss function of the tile position prediction training and the geometric element prediction training may be set to 1.0.
The knowledge point classification training specifically comprises the following steps: and extracting geometric elements from the text information and taking the geometric elements in the preset question answers as model training labels, deploying an N-type classifier, and performing training by using a loss function of binary cross entropy, wherein N is the total number of the geometric elements.
Specifically, in order to increase the overall perception of the problem by the model, the specific way of training the knowledge point classification is as follows: a data set is preset, which can summarize and assemble a plurality of knowledge points, and mark each question with one or more knowledge points. Next, knowledge points for each question are predicted based on the output solver. In addition, a K-way classifier with Binary Cross Entropy (BCE) as a loss function can be deployed to train the knowledge point prediction multi-label task, wherein K is the total number of possible knowledge points on the training set.
After the training tasks are completed, the auxiliary model can be better trained, each training task respectively improves the capability of each part of the model, and the capability of the model for searching geometric elements in the picture can be enhanced through jigsaw position prediction training; the geometric element prediction training can enhance the comprehension capability of the model on the topic; knowledge point classification training can enhance the ability of the model to use correct problem solving formulas when solving problems.
Specifically, the loss function of the model training is shown as follows:
Figure BDA0003229412180000081
wherein the loss function LgThe negative log likelihood estimation of the target program sequence is used for calculating the probability value of each preset sequence in the follow-up process;
in the above equation, θ is a parameter of the entire NGS structure without the graphics encoder, and x is an input of the program text and the extracted graphics feature.
And S13, fusing and aligning the text semantics and the visual semantics to obtain a solution program.
For solving the problem of geometric multi-item selection, it is important to understand the semantics of the problem text and the graph together and align the semantic information. The two semantics can be aligned to determine the content of the question, so that a corresponding answering program can be generated to obtain the correct answer.
In an alternative embodiment, an attention mechanism can be adopted to perform transmission and aggregation of two semantics, and finally, an operation program of an inference module output solution is obtained by combining text information and image information.
To facilitate the alignment and fusion of the two semantics, in one embodiment, step S13 may include the following sub-steps:
and a substep S131, respectively encoding the text semantics and the visual semantics and outputting encoded text semantics and encoded visual semantics representing a hidden state.
And a substep S132, aligning the coding text semantics and the coding visual semantics by using an attention mechanism in deep learning to obtain alignment semantic data.
And a substep S133 of inputting the alignment semantic data into two preset multilayer perceptrons to obtain an aggregated multimodal feature vector, and constructing a solution program by using the multimodal feature vector.
In actual practice, 12 self-attention units and 6 guiding attention units may be provided. Firstly, 6 self-attention units (namely original transformers) encode text semantics and visual semantics, and then the encoded text semantics and the encoded visual semantics in the final hidden state output by the 6 th self-attention unit are taken as guide information, and the guide information can perform sufficient fusion and alignment of semantic representation and visual semantics between a problem text and a diagram thereof.
In an optional embodiment, an attention mechanism in deep learning can be used in the process of fusing and aligning the text semantics and the visual semantics by 6 self-attention units, and the mechanism can correspond the input text representation to the input visual representation, so that the alignment of the text semantic representation and the visual semantic representation is realized, and the model capability and the accuracy of solving the problem are further improved.
Specifically, the remaining 6 self-attention units and 6 guide attention units can be stacked with each other, and corresponding aligned and fused data are output together, so that alignment and fusion of text semantic representation and visual semantic representation are enhanced, the model can better understand text information and visual information of the title, and the model can be higher in correct rate of solving the title by combining the text semantic representation and the visual semantic representation.
In one embodiment, the present application also introduces two-layer multi-tier perceptrons, and an attentiveness-reducing network with two multi-tier perceptrons may be applied to aggregate features.
In particular, the aligned semantic data may be input into two multi-tier perceptrons, thereby outputting an aggregated feature, the aggregated feature FDIs a multi-peak feature vector
Figure BDA0003229412180000102
Alternatively, canCharacteristics of polymerization
Figure BDA0003229412180000103
And a text encoder hnIs concatenated to obtain the last encoder state of
Figure BDA0003229412180000101
As the final collected multimodal feature vector for use in subsequent solving procedures.
Referring to fig. 2, a schematic diagram of the components of the solution program according to an embodiment of the present invention is shown.
In an alternative embodiment, the guidance information may be operated when outputting rich information including related question words and diagrams, and the output information includes corresponding information of text and vision, such as: a "triangle" in the topic text corresponds to a triangle element in the topic image.
It should be noted that, besides the common mathematical operations, the solving program may also include some operations representing theorems and formula knowledge to better solve the geometric problems, such as Pythagorean theorem and area calculation of the circular ring. And may not be additionally defined for some common or simple geometric formulas. For example, for a square with a side length of a, its area can be directly calculated by Multiply (a, a).
The interpretability may also be reflected in the sequential course of operations, selected constants and variables, and the application of theorems and formulas when outputting the solution program. As shown in fig. 2, the user can have a rough understanding of the entire problem solving process after reading the program.
Referring to fig. 3, a schematic diagram of a program symbol composition of an answer program according to an embodiment of the present invention is shown. In one of the demonstration modes, a new domain-specific language can be designed to model the precise operation program corresponding to the geometric problem, for example, the vocabulary library of the solution program may include operators op (operations), a constant Const, and a variable N appearing in the text-image geometry and a variable V appearing in the running process. As shown in fig. 3, the operator OP is divided into a plurality of categories including basic operation, arithmetic operation, trigonometric function, theorem and formula. Each operator OP involves n constant or variable elements. The constant Const may be predefined, and may be used in geometric problems, such as pi or 90 degrees at a right angle. The variable N appearing in the text-image geometry depends on the specific topic, while the variable V appearing in the run depends on the specific operational procedure.
Referring to fig. 3, various operators and constants defined herein may include: basic operators: assigning, multiplying by 2, and dividing by 2; arithmetic operator: addition, subtraction, multiplication and division; trigonometric function: sin, cos, tan, arcsin, arccos; theorem and formula, Pythagorean theorem to solve the formula of bevel edge and short edge, circle area, circumference, cone area and proportion; constants are: 30 °, 60 °, 90 °, 180 °, 360 °, pi, 0.618. Through the permutation and combination of various operational characters and constants, the method can be directly operated according to the solution program, so that the solution result can be quickly and conveniently obtained.
And S14, calculating the answer result according to the operation mode of the answer program.
Referring to fig. 3, in practical operation, a final solution result may be calculated from the solution program according to the operation coincidence and the value.
In order to quickly calculate the solution result, step S14 may include the following sub-steps, as an example:
substep S141, filter the program sequence from the solution program.
Since there are various solutions, there may be multiple solutions, and in order to filter the most convenient and accurate solution, in an embodiment, the sub-step S141 may include the following sub-steps:
and a substep S1411, decoding the text semantics and the visual semantics by using a preset LSTM decoder to obtain decoded information.
And a substep S1412, inputting the decoding information to a preset full connection layer to obtain an initial state, and obtaining a decoding hidden state of the LSTM decoder by using the initial state and a preset attention mechanism in series.
Sub-step S1413, inputting the decoding hidden state to a preset fully-connected layer through a preset softmax function to predict a plurality of preset sequences.
And a substep S1414, presetting a probability value of each preset sequence by using a preset negative log-likelihood estimation to obtain a plurality of probability values.
And a substep S1415 of screening the probability value with the maximum value from the plurality of probability values, and taking a preset sequence corresponding to the probability value with the maximum value as a program sequence.
Specifically, the decoding procedure is decoded using an LSTM decoder, followed by the order of { y }tT is 1. ltoreq. t.ltoreq.T as the target program to be generated, stIs the hidden state of the LSTM at time t. Applying a multimodal feature vector
Figure BDA0003229412180000111
Input to the linear layer to obtain an initial state s0. Let stConcatenated with the above-mentioned fused results and post-input into the linear layer by the softmax function to predict the next program symbol PtDistribution of (2). The linear layer is the full link layer, P represents a program sequence, PtI.e. the t-th in the program sequence. During training, predicting loss constraint of negative log-likelihood (NLL) of a target program; during testing, the probability distribution of all program sets is given, and the one with the highest probability is selected, so that the corresponding program sequence is obtained.
And a substep S142, obtaining operators and operational data contained in the program sequence.
And a substep S143 of calculating according to the program sequence, the operator and the operational data to obtain a solution result.
In addition, it should be noted that once the complete solver is decoded, each operator in the solver is sequentially executed to obtain a numerical result. N solutions { g before generating a bundle1,...,gnAfter this, the solution procedure is executed to calculate it step by step. If g isiWith grammatical errors (e.g. number of parameters)Not matching the current operator) or the executed value does not match any of the options in the current question, the execution process will fail. The first successfully executed program may be taken as the predictive solution and the corresponding action selected. If all N solvers fail, then the executive solver will report "no result" directly, without guessing an option.
In this embodiment, the embodiment of the present invention provides a geometric problem solution method based on deep learning and multi-modal numerical reasoning, which has the following beneficial effects: the invention can acquire text information and image information related to the subject content, acquire the semantics contained in the text information and the image information, and then fuse and align the text semantics and the visual semantics of the image, thereby preparing to solve the subject content and generating a corresponding solution program, and finally calculating the solution result according to the operation mode of the solution program, thereby not only increasing the solution accuracy, but also improving the processing efficiency, and realizing the technology which has good accuracy, strong practicability and can autonomously generate the code sequence of the answer through deep learning.
Referring to fig. 4, a schematic structural diagram of a geometric problem solving system based on deep learning and multi-modal numerical reasoning according to an embodiment of the present invention is shown.
The system is applied to a neural network model with bimodal union of text and images.
By way of example, the geometric problem solving system based on deep learning and multi-modal numerical reasoning can comprise:
an obtaining module 401, configured to obtain text information and image information about the title content respectively;
an encoding module 402, configured to encode the text information into a corresponding text hidden state to obtain a text semantic, and encode the image information into a corresponding image hidden state to obtain a visual semantic;
a fusion and alignment module 403, configured to fuse and align the text semantics and the visual semantics to obtain an answer program;
and a calculating module 404, configured to calculate an answer result according to an operation manner of the answer program.
Optionally, the fusion and alignment module is further configured to:
respectively coding the text semantics and the visual semantics and outputting coded text semantics and coded visual semantics representing a hidden state;
aligning the coding text semantics and the coding visual semantics by using an attention mechanism in deep learning to obtain aligned semantic data;
and inputting the alignment semantic data into two preset multilayer perceptrons to obtain an aggregated multimodal feature vector, and constructing a solution program by using the multimodal feature vector.
Optionally, the computing module is further configured to:
screening program sequences from the solution program;
acquiring operators and operation data contained in the program sequence;
and calculating according to the program sequence, the operator and the operation data to obtain a solution result.
Optionally, the computing module is further configured to:
decoding the text semantics and the visual semantics by using a preset LSTM decoder to obtain decoded information;
inputting the decoding information into a preset full-connection layer to obtain an initial state, and connecting the initial state and a preset attention mechanism in series to obtain a decoding hidden state of the LSTM decoder;
inputting the decoding hidden state to a preset full-link layer through a preset softmax function to predict a plurality of preset sequences;
presetting a probability value of each preset sequence by using preset negative log likelihood estimation to obtain a plurality of probability values;
and screening the probability value with the maximum value from the plurality of probability values, and taking a preset sequence corresponding to the probability value with the maximum value as a program sequence.
Optionally, the encoding module is further configured to:
and calling the first three layers of the residual error neural network trained by the model to encode the image information into a corresponding image hidden state to obtain visual semantics.
Optionally, the model training comprises puzzle position prediction training, geometric element prediction training and knowledge point classification training;
the jigsaw position prediction training specifically comprises the following steps: cutting the image of the image information into a plurality of image blocks, randomly cutting one image block and predicting the position information of the cut image block in the image;
the geometric element prediction training specifically comprises the following steps: inputting the image information into a residual error neural network, so that the residual error neural network can predict geometric elements contained in the image information;
the knowledge point classification training specifically comprises the following steps: and extracting geometric elements from the text information and taking the geometric elements in the preset question answers as model training labels, deploying an N-type classifier, and performing training by using a loss function of binary cross entropy, wherein N is the total number of the geometric elements.
Optionally, the loss function of the model training is as follows:
Figure BDA0003229412180000141
wherein the loss function LgThe negative log likelihood estimation of the target program sequence is used for calculating the probability value of each preset sequence;
in the above equation, θ is a parameter of the entire NGS structure without the graphics encoder, and x is an input of the program text and the extracted graphics feature.
Optionally, the encoding module is further configured to:
converting each word in the text information into a word vector;
inputting each word vector into a preset single-layer non-bidirectional LSTM model to obtain a hidden state corresponding to each word;
and coding the sequence of each hidden state to obtain text semantics.
The embodiment of the invention also provides a neural network model with bimodal combination of texts and images, and referring to fig. 5, a schematic structural diagram of the neural network model with bimodal combination of texts and images provided by the embodiment of the invention is shown.
The neural network model is suitable for the geometric problem solving method based on deep learning and multi-modal numerical reasoning as described above,
by way of example, the neural network model of bimodal union of text and image may include: a text encoder, an image encoder, a joint reasoning module and a program decoder;
wherein the text encoder, the image encoder and the program decoder are respectively connected with the joint reasoning module;
the text encoder is used for acquiring text information related to the title content and encoding the text information into a corresponding text hidden state to obtain text semantics;
the image encoder is used for acquiring image information related to the subject content and encoding the image information into a corresponding image hidden state to obtain visual semantics;
the joint reasoning module is used for fusing and aligning the text semantics and the visual semantics to obtain an answer program;
and the program decoder is used for calculating the solution result according to the operation mode of the solution program.
Optionally, the joint reasoning module comprises 12 self-attention units and 6 attentive units;
the 6 self-attention units are used for respectively coding the text semantics and the visual semantics and outputting coded text semantics and coded visual semantics representing a hidden state;
the system comprises 6 self-attention units and 6 guiding attention units, wherein the self-attention units and the guiding attention units are used for using an attention mechanism in deep learning to correspond coded text semantics to coded visual semantics to obtain aligned semantic data, inputting the aligned semantic data to two preset multilayer perceptrons to obtain an aggregated multimodal feature vector, and constructing a solution program by using the multimodal feature vector.
Further, an embodiment of the present application further provides an electronic device, including: the geometric problem solving method based on deep learning and multi-modal numerical reasoning comprises the following steps of storing a geometric problem solving program, storing a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the program to realize the geometric problem solving method based on deep learning and multi-modal numerical reasoning.
Further, the present application provides a computer-readable storage medium, which stores computer-executable instructions for causing a computer to execute the geometric problem solving method based on deep learning and multi-modal numerical reasoning as described in the above embodiments.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A geometric problem solving method based on deep learning and multi-modal numerical reasoning is characterized in that the method is applied to a neural network model with bimodal combination of texts and images, and the method comprises the following steps:
respectively acquiring text information and image information about the subject content;
encoding the text information into a corresponding text hidden state to obtain text semantics, and encoding the image information into a corresponding image hidden state to obtain visual semantics;
fusing and aligning the text semantics and the visual semantics to obtain a solution program;
and calculating an answer result according to the operation mode of the answer program.
2. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 1, wherein the fusing and aligning the text semantics and the visual semantics to obtain a solving program comprises:
respectively coding the text semantics and the visual semantics and outputting coded text semantics and coded visual semantics representing a hidden state;
aligning the coding text semantics and the coding visual semantics by using an attention mechanism in deep learning to obtain aligned semantic data;
and inputting the alignment semantic data into two preset multilayer perceptrons to obtain an aggregated multimodal feature vector, and constructing a solution program by using the multimodal feature vector.
3. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 1, wherein the calculating a solution result according to the operation manner of the solution program comprises:
screening program sequences from the solution program;
acquiring operators and operation data contained in the program sequence;
and calculating according to the program sequence, the operator and the operation data to obtain a solution result.
4. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 3, wherein the program screening sequence from the solving program comprises:
decoding the text semantics and the visual semantics by using a preset LSTM decoder to obtain decoded information;
inputting the decoding information into a preset full-connection layer to obtain an initial state, and connecting the initial state and a preset attention mechanism in series to obtain a decoding hidden state of the LSTM decoder;
inputting the decoding hidden state to a preset full-link layer through a preset softmax function to predict a plurality of preset sequences;
presetting a probability value of each preset sequence by using preset negative log likelihood estimation to obtain a plurality of probability values;
and screening the probability value with the maximum value from the plurality of probability values, and taking a preset sequence corresponding to the probability value with the maximum value as a program sequence.
5. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 1, wherein the encoding the image information into the corresponding image hidden state to obtain visual semantics comprises:
and calling the first three layers of the residual error neural network trained by the model to encode the image information into a corresponding image hidden state to obtain visual semantics.
6. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 5, wherein the model training comprises jigsaw position prediction training, geometric element prediction training and knowledge point classification training;
the jigsaw position prediction training specifically comprises the following steps: cutting the image of the image information into a plurality of image blocks, randomly cutting one image block and predicting the position information of the cut image block in the image;
the geometric element prediction training specifically comprises the following steps: inputting the image information into a residual error neural network, so that the residual error neural network can predict geometric elements contained in the image information;
the knowledge point classification training specifically comprises the following steps: and extracting geometric elements from the text information and taking the geometric elements in the preset question answers as model training labels, deploying an N-type classifier, and performing training by using a loss function of binary cross entropy, wherein N is the total number of the geometric elements.
7. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 6, wherein the loss function of the model training is as follows:
Figure FDA0003229412170000031
wherein the loss function LgThe negative log likelihood estimation of the target program sequence is used for calculating the probability value of each preset sequence;
in the above equation, θ is a parameter of the entire NGS structure without the graphics encoder, and x is an input of the program text and the extracted graphics feature.
8. The geometric problem solving method based on deep learning and multi-modal numerical reasoning according to claim 1, wherein the encoding the text information into the corresponding text hidden state to obtain the text semantics comprises:
converting each word in the text information into a word vector;
inputting each word vector into a preset single-layer non-bidirectional LSTM model to obtain a hidden state corresponding to each word;
and coding the sequence of each hidden state to obtain text semantics.
9. A neural network model for bimodal union of text and images, the neural network model being suitable for the method for solving geometric problems based on deep learning and multimodal numerical reasoning according to any one of claims 1 to 8, the neural network model comprising: a text encoder, an image encoder, a joint reasoning module and a program decoder;
wherein the text encoder, the image encoder and the program decoder are respectively connected with the joint reasoning module;
the text encoder is used for acquiring text information related to the title content and encoding the text information into a corresponding text hidden state to obtain text semantics;
the image encoder is used for acquiring image information related to the subject content and encoding the image information into a corresponding image hidden state to obtain visual semantics;
the joint reasoning module is used for fusing and aligning the text semantics and the visual semantics to obtain an answer program;
and the program decoder is used for calculating the solution result according to the operation mode of the solution program.
10. The text and image bimodal joint neural network model of claim 9, wherein the joint inference module includes 12 self-attention units and 6 mentoring units;
the 6 self-attention units are used for respectively coding the text semantics and the visual semantics and outputting coded text semantics and coded visual semantics representing a hidden state;
the system comprises 6 self-attention units and 6 guiding attention units, wherein the self-attention units and the guiding attention units are used for using an attention mechanism in deep learning to correspond coded text semantics to coded visual semantics to obtain aligned semantic data, inputting the aligned semantic data to two preset multilayer perceptrons to obtain an aggregated multimodal feature vector, and constructing a solution program by using the multimodal feature vector.
CN202110982368.2A 2021-08-25 2021-08-25 Geometric question answering method and model based on deep learning and multi-mode numerical reasoning Pending CN113672716A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110982368.2A CN113672716A (en) 2021-08-25 2021-08-25 Geometric question answering method and model based on deep learning and multi-mode numerical reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110982368.2A CN113672716A (en) 2021-08-25 2021-08-25 Geometric question answering method and model based on deep learning and multi-mode numerical reasoning

Publications (1)

Publication Number Publication Date
CN113672716A true CN113672716A (en) 2021-11-19

Family

ID=78546252

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110982368.2A Pending CN113672716A (en) 2021-08-25 2021-08-25 Geometric question answering method and model based on deep learning and multi-mode numerical reasoning

Country Status (1)

Country Link
CN (1) CN113672716A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780775A (en) * 2022-04-24 2022-07-22 西安交通大学 Image description text generation method based on content selection and guide mechanism
CN114861889A (en) * 2022-07-04 2022-08-05 北京百度网讯科技有限公司 Deep learning model training method, target object detection method and device
CN116071835A (en) * 2023-04-07 2023-05-05 平安银行股份有限公司 Face recognition attack post screening method and device and electronic equipment
CN117633643A (en) * 2024-01-26 2024-03-01 江西师范大学 Automatic middle school geometric problem solving method based on contrast learning
CN117726721A (en) * 2024-02-08 2024-03-19 湖南君安科技有限公司 Image generation method, device and medium based on theme drive and multi-mode fusion
CN117892140A (en) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046668A (en) * 2019-12-04 2020-04-21 北京信息科技大学 Method and device for recognizing named entities of multi-modal cultural relic data
CN113656570A (en) * 2021-08-25 2021-11-16 平安科技(深圳)有限公司 Visual question answering method and device based on deep learning model, medium and equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046668A (en) * 2019-12-04 2020-04-21 北京信息科技大学 Method and device for recognizing named entities of multi-modal cultural relic data
CN113656570A (en) * 2021-08-25 2021-11-16 平安科技(深圳)有限公司 Visual question answering method and device based on deep learning model, medium and equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIAQI CHEN等: "GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning", 《HTTPS://ARXIV.ORG/PDF/2105.14517V1.PDF》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780775A (en) * 2022-04-24 2022-07-22 西安交通大学 Image description text generation method based on content selection and guide mechanism
CN114861889A (en) * 2022-07-04 2022-08-05 北京百度网讯科技有限公司 Deep learning model training method, target object detection method and device
CN114861889B (en) * 2022-07-04 2022-09-27 北京百度网讯科技有限公司 Deep learning model training method, target object detection method and device
CN116071835A (en) * 2023-04-07 2023-05-05 平安银行股份有限公司 Face recognition attack post screening method and device and electronic equipment
CN116071835B (en) * 2023-04-07 2023-06-20 平安银行股份有限公司 Face recognition attack post screening method and device and electronic equipment
CN117633643A (en) * 2024-01-26 2024-03-01 江西师范大学 Automatic middle school geometric problem solving method based on contrast learning
CN117633643B (en) * 2024-01-26 2024-05-14 江西师范大学 Automatic middle school geometric problem solving method based on contrast learning
CN117726721A (en) * 2024-02-08 2024-03-19 湖南君安科技有限公司 Image generation method, device and medium based on theme drive and multi-mode fusion
CN117726721B (en) * 2024-02-08 2024-04-30 湖南君安科技有限公司 Image generation method, device and medium based on theme drive and multi-mode fusion
CN117892140A (en) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium
CN117892140B (en) * 2024-03-15 2024-05-31 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113672716A (en) Geometric question answering method and model based on deep learning and multi-mode numerical reasoning
CN112613303B (en) Knowledge distillation-based cross-modal image aesthetic quality evaluation method
Chen et al. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning
CN110825875B (en) Text entity type identification method and device, electronic equipment and storage medium
CN114511860B (en) Difference description statement generation method, device, equipment and medium
CN113656570A (en) Visual question answering method and device based on deep learning model, medium and equipment
CN110825867B (en) Similar text recommendation method and device, electronic equipment and storage medium
CN115829033B (en) Mathematic application question knowledge construction and solution method, system, equipment and storage medium
CN111160606B (en) Test question difficulty prediction method and related device
CN113282713A (en) Event trigger detection method based on difference neural representation model
CN111694935A (en) Multi-turn question and answer emotion determining method and device, computer equipment and storage medium
CN111126610A (en) Topic analysis method, topic analysis device, electronic device and storage medium
CN110765241B (en) Super-outline detection method and device for recommendation questions, electronic equipment and storage medium
CN114297399A (en) Knowledge graph generation method, knowledge graph generation system, storage medium and electronic equipment
CN111784048B (en) Test question difficulty prediction method and device, electronic equipment and storage medium
CN117421410A (en) Text matching method and device in question-answering system
US20240037336A1 (en) Methods, systems, and media for bi-modal understanding of natural languages and neural architectures
CN114358579A (en) Evaluation method, evaluation device, electronic device, and computer-readable storage medium
CN113010662B (en) Hierarchical conversational machine reading understanding system and method
CN114707518A (en) Semantic fragment-oriented target emotion analysis method, device, equipment and medium
CN112818688A (en) Text processing method, device, equipment and storage medium
CN114238587A (en) Reading understanding method and device, storage medium and computer equipment
CN115510199A (en) Data processing method, device and system
CN113505602A (en) Intelligent marking method and device suitable for judicial examination subjective questions and electronic equipment
CN117633643B (en) Automatic middle school geometric problem solving method based on contrast learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211119