CN117972070B - Large model form question-answering method - Google Patents
Large model form question-answering method Download PDFInfo
- Publication number
- CN117972070B CN117972070B CN202410382106.6A CN202410382106A CN117972070B CN 117972070 B CN117972070 B CN 117972070B CN 202410382106 A CN202410382106 A CN 202410382106A CN 117972070 B CN117972070 B CN 117972070B
- Authority
- CN
- China
- Prior art keywords
- question
- data
- model
- corpus
- answer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 78
- 238000012549 training Methods 0.000 claims abstract description 31
- 238000012545 processing Methods 0.000 claims abstract description 27
- 238000005516 engineering process Methods 0.000 claims abstract description 13
- 239000013598 vector Substances 0.000 claims description 43
- 230000008569 process Effects 0.000 claims description 16
- 239000011159 matrix material Substances 0.000 claims description 13
- 238000013507 mapping Methods 0.000 claims description 12
- 239000012634 fragment Substances 0.000 claims description 8
- QVFWZNCVPCJQOP-UHFFFAOYSA-N chloralodol Chemical compound CC(O)(C)CC(C)OC(O)C(Cl)(Cl)Cl QVFWZNCVPCJQOP-UHFFFAOYSA-N 0.000 claims description 5
- 210000001072 colon Anatomy 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 4
- 238000013515 script Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- RASFMBBIZLUCME-UHFFFAOYSA-N methyl 2-amino-3-[(2-amino-3-methoxycarbonylphenyl)methyl]benzoate Chemical compound COC(=O)C1=CC=CC(CC=2C(=C(C(=O)OC)C=CC=2)N)=C1N RASFMBBIZLUCME-UHFFFAOYSA-N 0.000 claims 1
- 238000013461 design Methods 0.000 abstract description 13
- 230000000694 effects Effects 0.000 abstract description 8
- 238000004458 analytical method Methods 0.000 abstract description 7
- 238000009472 formulation Methods 0.000 abstract description 5
- 239000000203 mixture Substances 0.000 abstract description 5
- 238000010801 machine learning Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a large model form question-answering method, which relates to the technical field of machine learning and comprises the following steps: collecting table data, and automatically generating text corpus according to a set corpus generation template and corpus generation rule; according to the text corpus, carrying out preference learning on the large model based on prompt learning and fine tuning technology to obtain a question-answer model; acquiring a question and answer task form and a question, processing the question and answer task form and the question by adopting an improved RCI method to obtain form sequence data and a question type, and obtaining a question and answer by utilizing the question and answer model according to the form sequence data and the question type. Automatically generating corpus through template design and rule formulation, training a better large model, and improving the accuracy and effect of the form question-answering based on the form serialization semantic analysis and the large model combined by rows and columns.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to a large model form question-answering method.
Background
The form data is used as a structured data storage mode with large knowledge density and is widely applied to various fields of business, education, medical treatment, military and the like. Large language models trained with a large number of unsupervised corpora have made significant progress in general task performance, but challenges remain in form question-answering applications, mainly including aspects of inadequate reasoning ability of complex questions, semantic understanding bias of the form structure, and high learning cost of user preference. The large language model is based on the rapid development of deep learning technology and the improvement and promotion of model architecture, and the capability of learning, understanding and processing natural language is achieved through training on a large-scale unsupervised corpus, so that excellent performance is achieved in various natural language processing tasks such as text generation, emotion analysis and the like. The form question and answer is one of important applications for the development of a large language model, can assist a user to quickly acquire required information, quicken data management and processing speed, and provides more comprehensive support for decision analysis of various industries. The most widely adopted techniques for prompt learning and fine tuning are currently in terms of how to improve the performance of large model specific tasks with as little resources as possible.
At present, the prompt learning technology mainly comprises manual template design, automatic discrete template generation, continuous template generation and the like, and the main stream of the method for carrying out fine adjustment on a large model comprises three methods: freeze method, P-Tuning method and Lora method. The form question and answer based on the large language model has the following difficulties in practical application: 1. big models face challenges in correctly understanding form data, including understanding the data relationships in the form, accurate identification of entities and attributes, and the like; 2. when the large model form question and answer is applied, the answer which meets the actual requirements and user preferences is difficult to generate, and how to promote the large model form question and answer effect as much as possible under the conditions of limited data resources, hardware resources and the like needs to be considered.
Therefore, how to improve the accuracy and effect of form questions and answers under the condition of limited resources is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a large model form question-answering method, and provides an automatic corpus generation scheme based on template design and rule formulation for correctly understanding form semantic problems of a large model, wherein the obtained corpus can be used for constructing a form question-answering knowledge base and carrying out large model fine tuning, and the data driving of a large model specific task under fewer resources is realized based on front-edge technology design such as prompt learning, lora fine tuning and the like, and the large model form question-answering capacity is improved based on form serialization semantic analysis and problem classification combined by rows and columns.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
A method for asking and answering a large model form comprises the following steps:
Step 1: collecting table data, and automatically generating text corpus according to a set corpus generation template and corpus generation rule;
Step 2: according to the text corpus, carrying out preference learning on the language big model based on prompt learning and fine tuning technology to obtain a question-answer model;
Step 3: acquiring a question and answer task form and a question, processing the question and answer task form and the question by adopting an improved RCI method to obtain form sequence data and a question type, and obtaining a question and answer by utilizing the question and answer model according to the form sequence data and the question type.
The technical effect of the technical scheme is that an automatic corpus generation scheme based on template design and rule formulation can be used for constructing a table question-answer knowledge base and carrying out large model fine adjustment, data driving of a large model specific task under fewer resources is realized based on prompt learning and fine adjustment technology design, table serialization semantic analysis and problem classification of row-column combination are realized based on an improved RCI method, and the question-answer capability of a large model table is improved.
Preferably, the implementation process of step 1 is as follows:
Step 11: collecting table data, analyzing and processing to obtain a list table;
Step 12: setting a plurality of corpus generating templates according to different question-answering tasks, and placing the corpus generating templates in a configuration file;
Step 13: setting corpus generation rules according to the characteristics and the structure of the form data, and placing the corpus generation rules in a configuration file;
Step 14: and filling the list table into the corpus generation template according to the corpus generation rule to generate text corpus.
Preferably, the corpus generating template determines a corpus structure and a corpus format, and further comprises placeholders, and the placeholders are replaced according to the table data.
Preferably, the corpus generation rule comprises data type conversion, value mapping, text processing and other operations; the value mapping includes mapping the numeric keys to corresponding string values using a dictionary mapping method.
Preferably, the implementation process of the step 2 is as follows:
step 21: generating a vector knowledge base by adopting a word2vec model according to the text corpus, performing similarity matching calculation with a question text vector of a question-answering task, and obtaining a corpus fragment most relevant to the question-answering task according to the maximum similarity;
step 22: generating a large model training data set according to the table data and the text corpus;
Step 23: taking the corpus fragment as a Prompt, adopting an automatic Promptv2 method of P-tuningv2, performing preference learning training on the language big model according to the big model training data set, and optimizing a trainable continuous Prompt to obtain an optimal Prompt vector;
Generating an optimal prompt vector according to the corpus fragment by adopting an artificial design method or an automatic Promptv2 method of P-tuningv; wherein,
The artificial design method is to generate an optimal prompt vector according to the corpus fragments and the requirements;
The automated Promptv2 method is to take the corpus fragment as a Prompt, adopt an automated Promptv2 method of P-tuningv2, perform preference learning training on a language big model according to the big model training dataset, optimize trainable discrete Prompt and continuous Prompt, and obtain an optimal Prompt vector of the Prompt;
Step 24: and carrying out fine adjustment on the parameters of the trained language big model through LoRA fine adjustment method according to the continuous Prompt in the optimal Prompt vector to obtain a question-answer model.
The technical effect of the technical scheme is that after the prompt is added and the model is subjected to efficient LoRA fine adjustment, the large model does not need to carry out full-parameter adjustment and optimization on form questions and answers under different task scenes, but the model performance under the limited resource condition is quickly improved in a mode of adjusting the intrinsic dimension of the large model and adapting the large model to the task, and meanwhile, more intelligent and personalized form information can be provided in the actual application scene, and user decision is assisted.
Preferably, the large model training data set includes a table data type, table data, and a text corpus. Respectively used as three fields of a form Type, an input Tnput and an Output in the large model training.
Preferably, the improved RCI method includes new sequence processing and classification prediction, and the specific implementation process of the step 3 is as follows:
Step 31: carrying out serialization processing on the question-answer task form by adopting a new sequence processing formula to obtain form sequence data, and attaching the form sequence data to a problem with a standard CLS and an SEP token to form a form data-problem sequence pair;
Step 32: inputting the table data-problem sequence pairs into a transducer encoder ALBERT; using the output of the CLS token in the linear layer of the encoder ALBERT as the vector representation of the j-th column sequence in the problem and table sequence data, respectively, connecting the problem vector, the table sequence column vector, the element product corresponding to the problem vector and the table sequence column vector, and the square of the element difference, and calculating the probability that the table sequence column vector belongs to the problem target by using a softmax function;
Step 33: selecting answer candidate rows from the table sequence data according to a set confidence threshold and the probability;
step 34: the problem vector is processed by a classifier based on a transducer in the classification prediction process to obtain the problem type;
Step 35: and combining the answer candidate row and the question type into an input prompt, and inputting the question-answer model to obtain a question-answer.
Preferably, the new sequence processing formula is expressed as:;
;
Wherein T represents table sequence data; an i-th data sequence representing table sequence data; /(I) Representing the addition of a colon symbol (':') after the header string; /(I)A vertical segmentation number ('|') is added before the cell value string; /(I)Representing a tandem operation; /(I)Adding a segmenter (' [ ' and ' ]) before and after each line of data sequence of the table sequence data; m represents the number of lines of the question-answer task table, n represents the number of columns of the question-answer task table, [/>,/>,…,/>List head of question-answer task list [And ] represents cells in the question-answer task table.
The technical effect of the technical scheme is that the improved RCI method is utilized to calculate probability, confidence coefficient threshold is obtained to obtain candidate rows where answers are located, the candidate rows are aggregated according to the problem types predicted by the classifier, the obtained results are commonly injected into the input of the large language model, the large language model is trained to generate accurate answers, the improved RCI method ensures understanding ability of the model to the form on the basis of simplifying data sequence processing of the form, and the intention recognition model is combined to classify the user asking intention, so that the large model is helped to further understand the structural relation of the data form.
Compared with the prior art, the invention discloses a large model form question-answering method, which is characterized in that a corpus generating scheme based on unified steps such as data processing, template design and rule formulation is provided on the premise of considering form diversity, user preference demands under different scenes are rapidly realized with fewer resources by combining a large model prompt fine tuning front technology, the requirements meeting specific user preference answers can be generated by injecting trainable prompts and using a low-rank fine tuning method to realize models under limited resource scenes, so that high-quality forms can be generated in batches through strong learning potential of the large model, knowledge forms can be generated according to form corpus, the obtained corpus can be used for question-answering of a database auxiliary model, training data sets can be also constructed, and model preference learning is realized. Aiming at the table semantic understanding deviation, a new table serialization mode is provided for searching relevant cells of a problem based on an RCI rank semantic capturing technology, the repeated serialization of the table is avoided, meanwhile, the operation type of predicting the problem by combining a classifier based on a transformer is combined, the candidate data row sequence in the cells is determined according to the predicted operation type, the further understanding of the table task by the model is enhanced, and the table question-answering task capability is improved by constructing and inputting the candidate data row sequence and the predicted problem type together to assist the large language model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a big model form-oriented question-answering flow provided by the invention;
FIG. 2 is a schematic diagram of a table-automated corpus generation process provided by the present invention;
FIG. 3 is a schematic diagram of a large model preference learning flow based on prompt learning and fine tuning provided by the invention;
FIG. 4 is a schematic diagram of a fine tuning process LoRA according to the present invention;
FIG. 5 is a schematic diagram of an RCI interaction model architecture provided by the present invention;
Fig. 6 is a flow chart of the large model aided reasoning provided by the invention based on RCI and problem classifier.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention discloses a large model form question-answering method, the flow is shown in figure 1, in addition, in the prior art:
(1) An automated Prompt method of P-tuningv2 was used:
assuming that the language model is The i-th prompt token in the prompt Template (abbreviated as T) is denoted as [ Pi ], and the context input is/>Target output is/>When mapping vectors in the character embedding process, a traditional discrete template (manually designed or automatically generated in discrete mode) converts a prompt template T into:
;
wherein the method comprises the steps of Representing a pre-trained embedded layer;
p-training regards [ P i ] as pseudo token, a trainable continuous embedded tensor The hint template T is converted into:
;
The P-turn method uses a hint encoder, a prompt encoder, to encode, uses LSTM (Long-Short Term Memory networks) and a double-layer multi-layer perceptron (MLP) using ReLU activation function to overcome the discreteness, so that the actual input vector of the language model is:
;
The improvement of the P-tuningv method based on P-tuning is not only in the input layer, but also in each layer of the language model, promtps token is added, and more trainable learning parameters and more direct and significant effect feedback are brought to large model preference training based on prompt fine tuning.
(2) LoRA (Low-Rank Adaptaion) is a Low-rank adaptive efficient fine tuning method, which aims to improve the fine tuning efficiency of a pre-training language model, and the flow is shown in fig. 4, where x is a large model input (table data and natural language questions), and y is a large model output (answer), and the specific process is as follows:
First, a bypass is added beside the original pre-training language model and passes through the matrix Sum matrix/>Performing dimension reduction and dimension increase operation to simulate an intrinsic rank; wherein/>For dimension reduction matrix initialization, using random gaussian distribution,/>For the updimensional matrix, zero matrix initialization is used,/>Input dimension and/>The output dimension of the model is the same as the input and output dimensions of the original pre-training language model, the weight parameters of the pre-training language model are fixed in the training process, and only the matrix/>And/>Training is carried out; after training is completed, will/>Matrix and/>Multiplying the matrix and combining the matrix with the pre-training language model parameters to form fine-tuned model parameters/>。
(3) The RCI (Row and Column Intersection) method provides a concept of serializing a question and a table row, uses a architecture based on a transformer to independently calculate the probability of the question contained in each row and each column of the table, and selects an answer closest to the question by predicting intersection of probabilities of cells in the rows and the columns, and the method can obtain excellent effects on the premise of not using a pretraining technology by verification, and comprises the following steps:
Let a table be m rows and n columns, the header of the table be [ [ ,/>,…,/>The cells can be [/> ], usedRepresented by ],/>; Better performance can be achieved by merging the table structures in the sequence representation, the RCI method flattens the header, representing the sequence of rows and columns as:
;
;
wherein the method comprises the steps of Data i representing a table,/>Column j data representing a table,/>Representing the addition of a colon symbol (':') after the header string,/>Representing that a vertical division number ('|'),/>, is added before the unit cell value stringRepresenting a tandem operation; for example, the table is shown in table 1 below:
Table 1 personnel basic information example
Name of name | Age of | Identification card number | Household address |
Zhang San (Zhang San) | 18 | 198002323490 | Region A1 of A city |
Liwu four-element bag | 24 | 903855086296 | B province B1 city |
Wangwu (five kinds of Chinese characters) | 22 | 472953057295 | C province B2 City |
The first line of data can be expressed as:
name Zhang Sany age 18 ID card number 198002323490 family address A city A1 area;
the second column of data may be represented as:
age 18|24|22;
after serializing the form in this manner, append to the questions with the standards [ CLS ] and [ SEP ] tokens, form a form data-question sequence pair and input the sequence pair to the transducer encoder ALBERT; the output of [ CLS ] tokens, i.e. vectors generated by the transducer model for [ CLS ] input token, are used in the linear layer as problem vector representation r q and the column-j sequence vector representation r c, respectively, and these vectors are integrated with their elements Squaring of element differencesConnected (connector:) then uses the softmax function on a linear layer to give the probability p (j∈Tc) that the column is the target of the problem, as shown in fig. 5, expressed as:
;
;
;
Wherein, Element differences representing the problem vector representation and the table sequence vector, W being the weight and b being the bias.
In this embodiment, the method for question answering of the large model form includes the following steps:
S1: automatically generating a corpus based on a template design and a table formulated by rules, as shown in fig. 2;
S11: table data parsing and processing: firstly, aiming at different expression forms of a given structured data table, such as a database format, an Html format, a Markdown format and the like, different scripts are used for analysis and processing; the table data is uniformly converted into a list format of a list by writing the python script, so that each field and numerical value in the table data can be extracted later, the table data can be ensured to be accurately applied to the template, and corpus generation work is carried out;
s12: corpus generation template design: a series of corresponding templates are designed and placed in a configuration file aiming at different question-answering tasks such as personnel histories, equipment retrieval, logistic resource guarantee and the like; these templates define the structure and format of the corpus generated, and placeholders can be included in the templates to facilitate replacement by actual data fills in subsequent steps; the template is a selectable fixed text structure designed for a specific task, wherein the template comprises specific fields or variables which can be replaced, and a user can freely select and modify corpus to generate the template according to the needs of the user;
s13: corpus generation rule preparation: according to the characteristics and structure of the form, corresponding rules are formulated to process data and fill templates, and the rules comprise data type conversion, value mapping, text processing and other operations; for example, a dictionary mapping method is adopted to map the numeric keys into corresponding character string values, such as key fields of personnel gender, ethnicity and the like, and the numerical representation of the key string values in the table data is mapped into corresponding readable corpus;
S14: corpus generation: filling the analyzed and processed data into a template according to a design template and a formulation rule to generate a corresponding text description; the corpus generating process maps specific fields to corresponding description words or phrases to generate a natural and smooth text corpus conforming to grammar rules;
The generation of form data to knowledge corpus can be realized through the automatic form corpus generation process, and the obtained corpus can be further used for training fine tuning of a large language model, assisting a question-answer knowledge base and the like;
s2: large model preference learning based on prompt learning and fine tuning;
The prompt learning method can be divided into two main types, namely artificial template design and automatic template generation, and the flow of large model preference learning by using the corpus obtained by the form automatic corpus generation technology is shown in figure 3;
S21: on the one hand, a vector knowledge base can be formed by a word2vec method embedding on the basis of the corpus obtained by the table corpus generation technology, similarity matching calculation is carried out on the vector knowledge base and the text vectors of the questions, and the most relevant corpus fragments are obtained and used as auxiliary knowledge to build a template of a promtt to assist a large model to directly answer in a table answer task; on the other hand, the method can also be used for model preference learning, in the preference learning process, the original data of the form and the generated corpus are integrated into a unified large model training data set, each piece of data comprises the form data Type (personnel, equipment, environment and the like), the form data and the text corpus generated in the form automation corpus generation link, the text corpus is respectively used as three fields of form Type, input and Output, and then the organized data is filled into an artificially designed template of the template or the large model is subjected to supervised training by using the organized data by adopting an automatic template method, the trainable continuous template is optimized, the optimal template Prompt vector is generated, and the large model in the specific field which meets the requirements of users and can generate related corpuses in batches is obtained;
S22: after the trainable continuous Prompt is optimized by using an automatic Promptv2 method to obtain an optimal Prompt vector, a language big model is subjected to fine adjustment by a LoRA fine adjustment method, and model parameters are added into the language big model;
Assume that model parameters facing form question-answering task are Then/>Can be expressed as:
;
;
wherein the method comprises the steps of For pre-training model original weights,/>For the continuous Prompt vector parameters injected using Promptv method, i.e. the best Prompt vector,/>, theIs the model self parameter which is updated by LoRA fine tuning; /(I)AndThe dimension reduction matrix and the dimension increase matrix are respectively a bypass which is added beside the model;
Firstly, fine tuning by using a p-tuning method to obtain an optimal campt, and then fine tuning model parameters by using lora; after adding prompts and carrying out efficient LoRA fine tuning on the model, the large model does not need to carry out full-parameter tuning on form questions and answers under different task scenes, but realizes rapid improvement of model expression under the condition of limited resources in a mode of adjusting the intrinsic dimension of the large model and adapting the large model to the tasks, and can provide more intelligent and personalized form information in the actual application scene at the same time to assist a user to make decisions;
s3: a table question-answering auxiliary technology based on row-column semantic capturing;
The accurate answer is realized in the form question-answering task, and the key points are that the cell related to the question is found and the operation required by the question is determined; although the RCI method has improved certain performance on the task of table question and answer, but the independent analysis based on the semanteme of the table is required to sequence the rows and columns of the table at the same time, and train the model for the row sequence and the model for the column sequence respectively, calculate the probability of the answer in the rows and the probability of the columns of the table, the method sequences the same table data twice and increases the complexity of data processing and training fine tuning, meanwhile, the method is mainly aimed at the simplest query type question and answer based on the table head, the complicated table question and answer usually requires to perform certain operations on the table data, such as the maximum value (max or min), the count (count), the average value (average), the comparison (compare) and the like, the embodiment provides a new sequence mode, aims at ensuring the understanding ability of the model on the basis of simplifying the processing of the table data sequence, and further improving the understanding ability of the table on the basis of the model, and further aims at the understanding of the new query type question and the table by combining with the intention of the user-intended question and the intention of the user-based on the classification model;
Let the table be m rows and n columns, the header be [ [ ,/>,…,/>The cells can be [/> ], usedTable data/>, in table 1Can be expressed as:
;
;
wherein the method comprises the steps of Data i representing a table,/>Representing the addition of a colon symbol (':') after the header string,/>Representing that a vertical division number ('|'),/>, is added before the unit cell value stringRepresenting tandem operation,/>Representing the addition of a segmenter (' [ ' and ' ]) before and after each row of data sequence, i.e., according to the proposed approach, the table data can be serialized as:
[ name/Zhang Sani age/18I ID card number/198002323490I home address/A1 area of A city ] [ name/Lifour I age/24I ID card number/903855086296I home address/B1 city of B province ] …;
And calculating probability by using an RCI principle by using a new sequence mode, obtaining a confidence coefficient threshold value to obtain candidate rows where answers are located, aggregating according to the question types predicted by the classifier, injecting the obtained results into a large language model input together, and training the large language model to generate accurate answers.
Example 2
Based on the above embodiments, in one specific embodiment, the overall assisted reasoning process is shown in fig. 6, and table 1 is taken as an example, and the table data can be serialized into the following table by a new sequence manner:
[ name, zhang Sany age, 18 identification number, 198002323490 family Address, A City A1 area ]
[ Name Lifour age: 24 identification card number: 903855086296 home address: B1 City of B ]
[ Name: wangwu age: 22 identification number: 472953057295 home address: C1 City, C-province ]
Problems: what is the king five greater than Zhang three? Setting a confidence threshold to 0.6;
The specific process is as follows:
The problem vector ALBERT (Q) corresponding to the problem passes through a problem classifier, and the operation types are as follows: comparing (compare), and calculating probability by RCI method by combining ALBERT (Q) and each row sequence vector of the table:
[ name, zhang Sany age, 18 ID card number, 198002323490 home address, A City A1 area ] 0.85;
[ name Lifour|age: 24|ID card number: 903855086296 |home address: B1 City, B-province ] 0.41;
[ name: wangwu|age: 22|ID card number: 472953057295 |home address: C1 City, C province ] 0.83;
and determining a first row and a third row of answer candidate lines according to the probability and the confidence threshold, and typing the answer candidate lines together with the question operation type to form a large model input prompt, so that the answer is 22-18=4.
According to the embodiment, the semantic understanding and complex reasoning capacity of the large model in the form question-answering task can be improved through the auxiliary reasoning mode of searching the answer candidate row and predicting the question type in the form serialization mode, and better performance is achieved.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (1)
1. The method for asking and answering the large model form is characterized by comprising the following steps:
Step 1: collecting table data, and automatically generating text corpus according to a set corpus generation template and corpus generation rule;
Step 11: collecting table data, analyzing and processing to obtain a list table;
analyzing and processing different scripts according to different expression forms of a given structured data table; uniformly converting the table data into a list format of a list by writing a python script;
Step 12: setting a plurality of corpus generating templates according to different question-answering tasks;
the corpus generating template is provided with a corpus structure, a corpus format and placeholders, and the placeholders are replaced according to the form data;
step 13: setting corpus generation rules according to the characteristics and the structure of the table data;
the corpus generation rule comprises data type conversion, value mapping and text processing; the value mapping comprises mapping the numeric keys into corresponding character string values by adopting a dictionary mapping method;
Step 14: filling the list form into the corpus generation template according to the corpus generation rule to generate text corpus; mapping the fields to corresponding description words or phrases to generate text corpus;
Step 2: according to the text corpus, carrying out preference learning on the language big model based on prompt learning and fine tuning technology to obtain a question-answer model;
Step 21: generating a vector knowledge base by adopting a word2vec model according to the text corpus, and performing similarity matching calculation with a question text vector of a question-answering task to obtain a corpus fragment corresponding to the maximum similarity as auxiliary knowledge;
Step 22: generating a large model training data set according to the table data and the text corpus; the large model training data set comprises table data types, table data and text corpus;
Step 23: taking the corpus fragment as a Prompt, adopting an automatic Promptv2 method of P-tuningv2, performing preference learning training on the language big model according to the big model training data set, and optimizing a trainable continuous Prompt to obtain an optimal Prompt vector;
Step 24: performing fine adjustment on the trained parameters of the language big model through a LoRA fine adjustment method according to the optimal prompt vector to obtain a question-answer model;
assuming that the model parameter facing the form question-answering task is W, W is expressed as:
W=Wpretrain+ΔW+Tprompt
ΔW=MBMA
Wherein W pretrain is the original weight of the pre-training model, T prompt is the optimal prompt vector, and DeltaW is the model self parameter updated by fine adjustment of LoRA; m A and M B are respectively a bypass dimension-reducing matrix and a dimension-increasing matrix which are added beside the model;
Step 3: acquiring a question-answering task form and a question, and processing the question-answering task form and the question by adopting an improved RCI method to obtain form sequence data and a question type, and obtaining a question-answering answer by utilizing the question-answering model according to the form sequence data and the question type;
The improved RCI method includes new sequence processing and classification prediction; the new sequence processing ensures the understanding capability of the model to the form on the basis of simplifying the form data sequence processing, adds a problem classifier based on a transducer, and classifies the user questioning intention by combining the intention recognition model;
the specific implementation process of the step 3 is as follows:
Step 31: carrying out serialization processing on the question-answer task form by adopting a new sequence processing formula to obtain form sequence data, and attaching the form sequence data to a question with a CLS and an SEP token to form a form data-question sequence pair;
Step 32: inputting a table data-problem sequence pair into an encoder ALBERT, respectively using the output of a CLS token in a linear layer of the encoder ALBERT as a problem and a vector representation of column data in the table sequence data, connecting the problem vector, the table sequence column vector, the element product corresponding to the problem vector and the table sequence column vector and the square of element difference, and calculating the probability that the table sequence column vector belongs to a problem corresponding target by using a softmax function;
Step 33: selecting answer candidate rows from the table sequence data according to a set confidence threshold and the probability;
step 34: the problem vector is processed by a classifier, and the problem type is obtained through prediction;
step 35: combining the answer candidate row and the question type into an input prompt, and inputting the question-answer model to obtain a question-answer;
the new sequence processing formula is expressed as:
wherein T represents the table sequence data; An i-th data sequence representing the table sequence data; m represents the number of lines of the question-answer task table, n represents the number of columns of the question-answer task table, [ h 1,h2,…,hn ] represents the header of the question-answer task table, [ v i,j ] represents the cells in the question-answer task table; /(I) Representing adding a colon symbol after the header string; /(I)Representing that a vertical division number is added in front of the cell value string; /(I)Representing a tandem operation; h denotes adding a separator before and after each line of data sequence of the table sequence data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410382106.6A CN117972070B (en) | 2024-04-01 | 2024-04-01 | Large model form question-answering method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410382106.6A CN117972070B (en) | 2024-04-01 | 2024-04-01 | Large model form question-answering method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117972070A CN117972070A (en) | 2024-05-03 |
CN117972070B true CN117972070B (en) | 2024-06-18 |
Family
ID=90846469
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410382106.6A Active CN117972070B (en) | 2024-04-01 | 2024-04-01 | Large model form question-answering method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117972070B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110110054A (en) * | 2019-03-22 | 2019-08-09 | 北京中科汇联科技股份有限公司 | A method of obtaining question and answer pair in the slave non-structured text based on deep learning |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220044134A1 (en) * | 2020-08-10 | 2022-02-10 | Accenture Global Solutions Limited | Intelligent question answering on tabular content |
CN114020862B (en) * | 2021-11-04 | 2024-06-11 | 中国矿业大学 | Search type intelligent question-answering system and method for coal mine safety regulations |
CN115062070A (en) * | 2022-05-30 | 2022-09-16 | 中国电子科技集团公司第十研究所 | Question and answer based text table data query method |
CN115455935A (en) * | 2022-09-14 | 2022-12-09 | 华东师范大学 | Intelligent text information processing system |
CN115794871A (en) * | 2022-12-07 | 2023-03-14 | 电子科技大学 | Table question-answer processing method based on Tapas model and graph attention network |
CN116303971A (en) * | 2023-03-29 | 2023-06-23 | 重庆交通大学 | Few-sample form question-answering method oriented to bridge management and maintenance field |
CN117743526A (en) * | 2023-11-21 | 2024-03-22 | 浙江大学计算机创新技术研究院 | Table question-answering method based on large language model and natural language processing |
-
2024
- 2024-04-01 CN CN202410382106.6A patent/CN117972070B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110110054A (en) * | 2019-03-22 | 2019-08-09 | 北京中科汇联科技股份有限公司 | A method of obtaining question and answer pair in the slave non-structured text based on deep learning |
Non-Patent Citations (1)
Title |
---|
基于语义模板的问答系统研究;梁正平 等;深圳大学学报(理工版);20070731;第24卷(第03期);第281-285页 * |
Also Published As
Publication number | Publication date |
---|---|
CN117972070A (en) | 2024-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112487143B (en) | Public opinion big data analysis-based multi-label text classification method | |
CN112328767B (en) | Question-answer matching method based on BERT model and comparative aggregation framework | |
CN109189925A (en) | Term vector model based on mutual information and based on the file classification method of CNN | |
CN107562792A (en) | A kind of question and answer matching process based on deep learning | |
CN111414461A (en) | Intelligent question-answering method and system fusing knowledge base and user modeling | |
CN112015868A (en) | Question-answering method based on knowledge graph completion | |
CN113312483A (en) | Text classification method based on self-attention mechanism and BiGRU | |
CN112256866A (en) | Text fine-grained emotion analysis method based on deep learning | |
CN114398976A (en) | Machine reading understanding method based on BERT and gate control type attention enhancement network | |
CN111984791A (en) | Long text classification method based on attention mechanism | |
CN111581364B (en) | Chinese intelligent question-answer short text similarity calculation method oriented to medical field | |
CN114417872A (en) | Contract text named entity recognition method and system | |
CN115270752A (en) | Template sentence evaluation method based on multilevel comparison learning | |
CN115687609A (en) | Zero sample relation extraction method based on Prompt multi-template fusion | |
CN114841173A (en) | Academic text semantic feature extraction method and system based on pre-training model and storage medium | |
CN112989803B (en) | Entity link prediction method based on topic vector learning | |
CN117034921B (en) | Prompt learning training method, device and medium based on user data | |
CN113920379A (en) | Zero sample image classification method based on knowledge assistance | |
CN116543289B (en) | Image description method based on encoder-decoder and Bi-LSTM attention model | |
CN117094395A (en) | Method, device and computer storage medium for complementing knowledge graph | |
CN117972070B (en) | Large model form question-answering method | |
CN112598065B (en) | Memory-based gating convolutional neural network semantic processing system and method | |
CN114997331A (en) | Small sample relation classification method and system based on metric learning | |
Jun et al. | Hierarchical multiples self-attention mechanism for multi-modal analysis | |
CN111259650A (en) | Text automatic generation method based on class mark sequence generation type countermeasure model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |