CN117972070B - Large model form question-answering method - Google Patents

Large model form question-answering method Download PDF

Info

Publication number
CN117972070B
CN117972070B CN202410382106.6A CN202410382106A CN117972070B CN 117972070 B CN117972070 B CN 117972070B CN 202410382106 A CN202410382106 A CN 202410382106A CN 117972070 B CN117972070 B CN 117972070B
Authority
CN
China
Prior art keywords
question
data
model
corpus
answer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410382106.6A
Other languages
Chinese (zh)
Other versions
CN117972070A (en
Inventor
郝韫宏
唐海超
李孟书
王立才
胡勋
罗琪彬
宋浩楠
乔思龙
曹杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202410382106.6A priority Critical patent/CN117972070B/en
Publication of CN117972070A publication Critical patent/CN117972070A/en
Application granted granted Critical
Publication of CN117972070B publication Critical patent/CN117972070B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a large model form question-answering method, which relates to the technical field of machine learning and comprises the following steps: collecting table data, and automatically generating text corpus according to a set corpus generation template and corpus generation rule; according to the text corpus, carrying out preference learning on the large model based on prompt learning and fine tuning technology to obtain a question-answer model; acquiring a question and answer task form and a question, processing the question and answer task form and the question by adopting an improved RCI method to obtain form sequence data and a question type, and obtaining a question and answer by utilizing the question and answer model according to the form sequence data and the question type. Automatically generating corpus through template design and rule formulation, training a better large model, and improving the accuracy and effect of the form question-answering based on the form serialization semantic analysis and the large model combined by rows and columns.

Description

Large model form question-answering method
Technical Field
The invention relates to the technical field of machine learning, in particular to a large model form question-answering method.
Background
The form data is used as a structured data storage mode with large knowledge density and is widely applied to various fields of business, education, medical treatment, military and the like. Large language models trained with a large number of unsupervised corpora have made significant progress in general task performance, but challenges remain in form question-answering applications, mainly including aspects of inadequate reasoning ability of complex questions, semantic understanding bias of the form structure, and high learning cost of user preference. The large language model is based on the rapid development of deep learning technology and the improvement and promotion of model architecture, and the capability of learning, understanding and processing natural language is achieved through training on a large-scale unsupervised corpus, so that excellent performance is achieved in various natural language processing tasks such as text generation, emotion analysis and the like. The form question and answer is one of important applications for the development of a large language model, can assist a user to quickly acquire required information, quicken data management and processing speed, and provides more comprehensive support for decision analysis of various industries. The most widely adopted techniques for prompt learning and fine tuning are currently in terms of how to improve the performance of large model specific tasks with as little resources as possible.
At present, the prompt learning technology mainly comprises manual template design, automatic discrete template generation, continuous template generation and the like, and the main stream of the method for carrying out fine adjustment on a large model comprises three methods: freeze method, P-Tuning method and Lora method. The form question and answer based on the large language model has the following difficulties in practical application: 1. big models face challenges in correctly understanding form data, including understanding the data relationships in the form, accurate identification of entities and attributes, and the like; 2. when the large model form question and answer is applied, the answer which meets the actual requirements and user preferences is difficult to generate, and how to promote the large model form question and answer effect as much as possible under the conditions of limited data resources, hardware resources and the like needs to be considered.
Therefore, how to improve the accuracy and effect of form questions and answers under the condition of limited resources is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a large model form question-answering method, and provides an automatic corpus generation scheme based on template design and rule formulation for correctly understanding form semantic problems of a large model, wherein the obtained corpus can be used for constructing a form question-answering knowledge base and carrying out large model fine tuning, and the data driving of a large model specific task under fewer resources is realized based on front-edge technology design such as prompt learning, lora fine tuning and the like, and the large model form question-answering capacity is improved based on form serialization semantic analysis and problem classification combined by rows and columns.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
A method for asking and answering a large model form comprises the following steps:
Step 1: collecting table data, and automatically generating text corpus according to a set corpus generation template and corpus generation rule;
Step 2: according to the text corpus, carrying out preference learning on the language big model based on prompt learning and fine tuning technology to obtain a question-answer model;
Step 3: acquiring a question and answer task form and a question, processing the question and answer task form and the question by adopting an improved RCI method to obtain form sequence data and a question type, and obtaining a question and answer by utilizing the question and answer model according to the form sequence data and the question type.
The technical effect of the technical scheme is that an automatic corpus generation scheme based on template design and rule formulation can be used for constructing a table question-answer knowledge base and carrying out large model fine adjustment, data driving of a large model specific task under fewer resources is realized based on prompt learning and fine adjustment technology design, table serialization semantic analysis and problem classification of row-column combination are realized based on an improved RCI method, and the question-answer capability of a large model table is improved.
Preferably, the implementation process of step 1 is as follows:
Step 11: collecting table data, analyzing and processing to obtain a list table;
Step 12: setting a plurality of corpus generating templates according to different question-answering tasks, and placing the corpus generating templates in a configuration file;
Step 13: setting corpus generation rules according to the characteristics and the structure of the form data, and placing the corpus generation rules in a configuration file;
Step 14: and filling the list table into the corpus generation template according to the corpus generation rule to generate text corpus.
Preferably, the corpus generating template determines a corpus structure and a corpus format, and further comprises placeholders, and the placeholders are replaced according to the table data.
Preferably, the corpus generation rule comprises data type conversion, value mapping, text processing and other operations; the value mapping includes mapping the numeric keys to corresponding string values using a dictionary mapping method.
Preferably, the implementation process of the step 2 is as follows:
step 21: generating a vector knowledge base by adopting a word2vec model according to the text corpus, performing similarity matching calculation with a question text vector of a question-answering task, and obtaining a corpus fragment most relevant to the question-answering task according to the maximum similarity;
step 22: generating a large model training data set according to the table data and the text corpus;
Step 23: taking the corpus fragment as a Prompt, adopting an automatic Promptv2 method of P-tuningv2, performing preference learning training on the language big model according to the big model training data set, and optimizing a trainable continuous Prompt to obtain an optimal Prompt vector;
Generating an optimal prompt vector according to the corpus fragment by adopting an artificial design method or an automatic Promptv2 method of P-tuningv; wherein,
The artificial design method is to generate an optimal prompt vector according to the corpus fragments and the requirements;
The automated Promptv2 method is to take the corpus fragment as a Prompt, adopt an automated Promptv2 method of P-tuningv2, perform preference learning training on a language big model according to the big model training dataset, optimize trainable discrete Prompt and continuous Prompt, and obtain an optimal Prompt vector of the Prompt;
Step 24: and carrying out fine adjustment on the parameters of the trained language big model through LoRA fine adjustment method according to the continuous Prompt in the optimal Prompt vector to obtain a question-answer model.
The technical effect of the technical scheme is that after the prompt is added and the model is subjected to efficient LoRA fine adjustment, the large model does not need to carry out full-parameter adjustment and optimization on form questions and answers under different task scenes, but the model performance under the limited resource condition is quickly improved in a mode of adjusting the intrinsic dimension of the large model and adapting the large model to the task, and meanwhile, more intelligent and personalized form information can be provided in the actual application scene, and user decision is assisted.
Preferably, the large model training data set includes a table data type, table data, and a text corpus. Respectively used as three fields of a form Type, an input Tnput and an Output in the large model training.
Preferably, the improved RCI method includes new sequence processing and classification prediction, and the specific implementation process of the step 3 is as follows:
Step 31: carrying out serialization processing on the question-answer task form by adopting a new sequence processing formula to obtain form sequence data, and attaching the form sequence data to a problem with a standard CLS and an SEP token to form a form data-problem sequence pair;
Step 32: inputting the table data-problem sequence pairs into a transducer encoder ALBERT; using the output of the CLS token in the linear layer of the encoder ALBERT as the vector representation of the j-th column sequence in the problem and table sequence data, respectively, connecting the problem vector, the table sequence column vector, the element product corresponding to the problem vector and the table sequence column vector, and the square of the element difference, and calculating the probability that the table sequence column vector belongs to the problem target by using a softmax function;
Step 33: selecting answer candidate rows from the table sequence data according to a set confidence threshold and the probability;
step 34: the problem vector is processed by a classifier based on a transducer in the classification prediction process to obtain the problem type;
Step 35: and combining the answer candidate row and the question type into an input prompt, and inputting the question-answer model to obtain a question-answer.
Preferably, the new sequence processing formula is expressed as:
Wherein T represents table sequence data; an i-th data sequence representing table sequence data; /(I) Representing the addition of a colon symbol (':') after the header string; /(I)A vertical segmentation number ('|') is added before the cell value string; /(I)Representing a tandem operation; /(I)Adding a segmenter (' [ ' and ' ]) before and after each line of data sequence of the table sequence data; m represents the number of lines of the question-answer task table, n represents the number of columns of the question-answer task table, [/>,/>,…,/>List head of question-answer task list [And ] represents cells in the question-answer task table.
The technical effect of the technical scheme is that the improved RCI method is utilized to calculate probability, confidence coefficient threshold is obtained to obtain candidate rows where answers are located, the candidate rows are aggregated according to the problem types predicted by the classifier, the obtained results are commonly injected into the input of the large language model, the large language model is trained to generate accurate answers, the improved RCI method ensures understanding ability of the model to the form on the basis of simplifying data sequence processing of the form, and the intention recognition model is combined to classify the user asking intention, so that the large model is helped to further understand the structural relation of the data form.
Compared with the prior art, the invention discloses a large model form question-answering method, which is characterized in that a corpus generating scheme based on unified steps such as data processing, template design and rule formulation is provided on the premise of considering form diversity, user preference demands under different scenes are rapidly realized with fewer resources by combining a large model prompt fine tuning front technology, the requirements meeting specific user preference answers can be generated by injecting trainable prompts and using a low-rank fine tuning method to realize models under limited resource scenes, so that high-quality forms can be generated in batches through strong learning potential of the large model, knowledge forms can be generated according to form corpus, the obtained corpus can be used for question-answering of a database auxiliary model, training data sets can be also constructed, and model preference learning is realized. Aiming at the table semantic understanding deviation, a new table serialization mode is provided for searching relevant cells of a problem based on an RCI rank semantic capturing technology, the repeated serialization of the table is avoided, meanwhile, the operation type of predicting the problem by combining a classifier based on a transformer is combined, the candidate data row sequence in the cells is determined according to the predicted operation type, the further understanding of the table task by the model is enhanced, and the table question-answering task capability is improved by constructing and inputting the candidate data row sequence and the predicted problem type together to assist the large language model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a big model form-oriented question-answering flow provided by the invention;
FIG. 2 is a schematic diagram of a table-automated corpus generation process provided by the present invention;
FIG. 3 is a schematic diagram of a large model preference learning flow based on prompt learning and fine tuning provided by the invention;
FIG. 4 is a schematic diagram of a fine tuning process LoRA according to the present invention;
FIG. 5 is a schematic diagram of an RCI interaction model architecture provided by the present invention;
Fig. 6 is a flow chart of the large model aided reasoning provided by the invention based on RCI and problem classifier.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention discloses a large model form question-answering method, the flow is shown in figure 1, in addition, in the prior art:
(1) An automated Prompt method of P-tuningv2 was used:
assuming that the language model is The i-th prompt token in the prompt Template (abbreviated as T) is denoted as [ Pi ], and the context input is/>Target output is/>When mapping vectors in the character embedding process, a traditional discrete template (manually designed or automatically generated in discrete mode) converts a prompt template T into:
wherein the method comprises the steps of Representing a pre-trained embedded layer;
p-training regards [ P i ] as pseudo token, a trainable continuous embedded tensor The hint template T is converted into:
The P-turn method uses a hint encoder, a prompt encoder, to encode, uses LSTM (Long-Short Term Memory networks) and a double-layer multi-layer perceptron (MLP) using ReLU activation function to overcome the discreteness, so that the actual input vector of the language model is:
The improvement of the P-tuningv method based on P-tuning is not only in the input layer, but also in each layer of the language model, promtps token is added, and more trainable learning parameters and more direct and significant effect feedback are brought to large model preference training based on prompt fine tuning.
(2) LoRA (Low-Rank Adaptaion) is a Low-rank adaptive efficient fine tuning method, which aims to improve the fine tuning efficiency of a pre-training language model, and the flow is shown in fig. 4, where x is a large model input (table data and natural language questions), and y is a large model output (answer), and the specific process is as follows:
First, a bypass is added beside the original pre-training language model and passes through the matrix Sum matrix/>Performing dimension reduction and dimension increase operation to simulate an intrinsic rank; wherein/>For dimension reduction matrix initialization, using random gaussian distribution,/>For the updimensional matrix, zero matrix initialization is used,/>Input dimension and/>The output dimension of the model is the same as the input and output dimensions of the original pre-training language model, the weight parameters of the pre-training language model are fixed in the training process, and only the matrix/>And/>Training is carried out; after training is completed, will/>Matrix and/>Multiplying the matrix and combining the matrix with the pre-training language model parameters to form fine-tuned model parameters/>
(3) The RCI (Row and Column Intersection) method provides a concept of serializing a question and a table row, uses a architecture based on a transformer to independently calculate the probability of the question contained in each row and each column of the table, and selects an answer closest to the question by predicting intersection of probabilities of cells in the rows and the columns, and the method can obtain excellent effects on the premise of not using a pretraining technology by verification, and comprises the following steps:
Let a table be m rows and n columns, the header of the table be [ [ ,/>,…,/>The cells can be [/> ], usedRepresented by ],/>; Better performance can be achieved by merging the table structures in the sequence representation, the RCI method flattens the header, representing the sequence of rows and columns as:
wherein the method comprises the steps of Data i representing a table,/>Column j data representing a table,/>Representing the addition of a colon symbol (':') after the header string,/>Representing that a vertical division number ('|'),/>, is added before the unit cell value stringRepresenting a tandem operation; for example, the table is shown in table 1 below:
Table 1 personnel basic information example
Name of name Age of Identification card number Household address
Zhang San (Zhang San) 18 198002323490 Region A1 of A city
Liwu four-element bag 24 903855086296 B province B1 city
Wangwu (five kinds of Chinese characters) 22 472953057295 C province B2 City
The first line of data can be expressed as:
name Zhang Sany age 18 ID card number 198002323490 family address A city A1 area;
the second column of data may be represented as:
age 18|24|22;
after serializing the form in this manner, append to the questions with the standards [ CLS ] and [ SEP ] tokens, form a form data-question sequence pair and input the sequence pair to the transducer encoder ALBERT; the output of [ CLS ] tokens, i.e. vectors generated by the transducer model for [ CLS ] input token, are used in the linear layer as problem vector representation r q and the column-j sequence vector representation r c, respectively, and these vectors are integrated with their elements Squaring of element differencesConnected (connector:) then uses the softmax function on a linear layer to give the probability p (j∈Tc) that the column is the target of the problem, as shown in fig. 5, expressed as:
Wherein, Element differences representing the problem vector representation and the table sequence vector, W being the weight and b being the bias.
In this embodiment, the method for question answering of the large model form includes the following steps:
S1: automatically generating a corpus based on a template design and a table formulated by rules, as shown in fig. 2;
S11: table data parsing and processing: firstly, aiming at different expression forms of a given structured data table, such as a database format, an Html format, a Markdown format and the like, different scripts are used for analysis and processing; the table data is uniformly converted into a list format of a list by writing the python script, so that each field and numerical value in the table data can be extracted later, the table data can be ensured to be accurately applied to the template, and corpus generation work is carried out;
s12: corpus generation template design: a series of corresponding templates are designed and placed in a configuration file aiming at different question-answering tasks such as personnel histories, equipment retrieval, logistic resource guarantee and the like; these templates define the structure and format of the corpus generated, and placeholders can be included in the templates to facilitate replacement by actual data fills in subsequent steps; the template is a selectable fixed text structure designed for a specific task, wherein the template comprises specific fields or variables which can be replaced, and a user can freely select and modify corpus to generate the template according to the needs of the user;
s13: corpus generation rule preparation: according to the characteristics and structure of the form, corresponding rules are formulated to process data and fill templates, and the rules comprise data type conversion, value mapping, text processing and other operations; for example, a dictionary mapping method is adopted to map the numeric keys into corresponding character string values, such as key fields of personnel gender, ethnicity and the like, and the numerical representation of the key string values in the table data is mapped into corresponding readable corpus;
S14: corpus generation: filling the analyzed and processed data into a template according to a design template and a formulation rule to generate a corresponding text description; the corpus generating process maps specific fields to corresponding description words or phrases to generate a natural and smooth text corpus conforming to grammar rules;
The generation of form data to knowledge corpus can be realized through the automatic form corpus generation process, and the obtained corpus can be further used for training fine tuning of a large language model, assisting a question-answer knowledge base and the like;
s2: large model preference learning based on prompt learning and fine tuning;
The prompt learning method can be divided into two main types, namely artificial template design and automatic template generation, and the flow of large model preference learning by using the corpus obtained by the form automatic corpus generation technology is shown in figure 3;
S21: on the one hand, a vector knowledge base can be formed by a word2vec method embedding on the basis of the corpus obtained by the table corpus generation technology, similarity matching calculation is carried out on the vector knowledge base and the text vectors of the questions, and the most relevant corpus fragments are obtained and used as auxiliary knowledge to build a template of a promtt to assist a large model to directly answer in a table answer task; on the other hand, the method can also be used for model preference learning, in the preference learning process, the original data of the form and the generated corpus are integrated into a unified large model training data set, each piece of data comprises the form data Type (personnel, equipment, environment and the like), the form data and the text corpus generated in the form automation corpus generation link, the text corpus is respectively used as three fields of form Type, input and Output, and then the organized data is filled into an artificially designed template of the template or the large model is subjected to supervised training by using the organized data by adopting an automatic template method, the trainable continuous template is optimized, the optimal template Prompt vector is generated, and the large model in the specific field which meets the requirements of users and can generate related corpuses in batches is obtained;
S22: after the trainable continuous Prompt is optimized by using an automatic Promptv2 method to obtain an optimal Prompt vector, a language big model is subjected to fine adjustment by a LoRA fine adjustment method, and model parameters are added into the language big model;
Assume that model parameters facing form question-answering task are Then/>Can be expressed as:
wherein the method comprises the steps of For pre-training model original weights,/>For the continuous Prompt vector parameters injected using Promptv method, i.e. the best Prompt vector,/>, theIs the model self parameter which is updated by LoRA fine tuning; /(I)AndThe dimension reduction matrix and the dimension increase matrix are respectively a bypass which is added beside the model;
Firstly, fine tuning by using a p-tuning method to obtain an optimal campt, and then fine tuning model parameters by using lora; after adding prompts and carrying out efficient LoRA fine tuning on the model, the large model does not need to carry out full-parameter tuning on form questions and answers under different task scenes, but realizes rapid improvement of model expression under the condition of limited resources in a mode of adjusting the intrinsic dimension of the large model and adapting the large model to the tasks, and can provide more intelligent and personalized form information in the actual application scene at the same time to assist a user to make decisions;
s3: a table question-answering auxiliary technology based on row-column semantic capturing;
The accurate answer is realized in the form question-answering task, and the key points are that the cell related to the question is found and the operation required by the question is determined; although the RCI method has improved certain performance on the task of table question and answer, but the independent analysis based on the semanteme of the table is required to sequence the rows and columns of the table at the same time, and train the model for the row sequence and the model for the column sequence respectively, calculate the probability of the answer in the rows and the probability of the columns of the table, the method sequences the same table data twice and increases the complexity of data processing and training fine tuning, meanwhile, the method is mainly aimed at the simplest query type question and answer based on the table head, the complicated table question and answer usually requires to perform certain operations on the table data, such as the maximum value (max or min), the count (count), the average value (average), the comparison (compare) and the like, the embodiment provides a new sequence mode, aims at ensuring the understanding ability of the model on the basis of simplifying the processing of the table data sequence, and further improving the understanding ability of the table on the basis of the model, and further aims at the understanding of the new query type question and the table by combining with the intention of the user-intended question and the intention of the user-based on the classification model;
Let the table be m rows and n columns, the header be [ [ ,/>,…,/>The cells can be [/> ], usedTable data/>, in table 1Can be expressed as:
wherein the method comprises the steps of Data i representing a table,/>Representing the addition of a colon symbol (':') after the header string,/>Representing that a vertical division number ('|'),/>, is added before the unit cell value stringRepresenting tandem operation,/>Representing the addition of a segmenter (' [ ' and ' ]) before and after each row of data sequence, i.e., according to the proposed approach, the table data can be serialized as:
[ name/Zhang Sani age/18I ID card number/198002323490I home address/A1 area of A city ] [ name/Lifour I age/24I ID card number/903855086296I home address/B1 city of B province ] …;
And calculating probability by using an RCI principle by using a new sequence mode, obtaining a confidence coefficient threshold value to obtain candidate rows where answers are located, aggregating according to the question types predicted by the classifier, injecting the obtained results into a large language model input together, and training the large language model to generate accurate answers.
Example 2
Based on the above embodiments, in one specific embodiment, the overall assisted reasoning process is shown in fig. 6, and table 1 is taken as an example, and the table data can be serialized into the following table by a new sequence manner:
[ name, zhang Sany age, 18 identification number, 198002323490 family Address, A City A1 area ]
[ Name Lifour age: 24 identification card number: 903855086296 home address: B1 City of B ]
[ Name: wangwu age: 22 identification number: 472953057295 home address: C1 City, C-province ]
Problems: what is the king five greater than Zhang three? Setting a confidence threshold to 0.6;
The specific process is as follows:
The problem vector ALBERT (Q) corresponding to the problem passes through a problem classifier, and the operation types are as follows: comparing (compare), and calculating probability by RCI method by combining ALBERT (Q) and each row sequence vector of the table:
[ name, zhang Sany age, 18 ID card number, 198002323490 home address, A City A1 area ] 0.85;
[ name Lifour|age: 24|ID card number: 903855086296 |home address: B1 City, B-province ] 0.41;
[ name: wangwu|age: 22|ID card number: 472953057295 |home address: C1 City, C province ] 0.83;
and determining a first row and a third row of answer candidate lines according to the probability and the confidence threshold, and typing the answer candidate lines together with the question operation type to form a large model input prompt, so that the answer is 22-18=4.
According to the embodiment, the semantic understanding and complex reasoning capacity of the large model in the form question-answering task can be improved through the auxiliary reasoning mode of searching the answer candidate row and predicting the question type in the form serialization mode, and better performance is achieved.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (1)

1. The method for asking and answering the large model form is characterized by comprising the following steps:
Step 1: collecting table data, and automatically generating text corpus according to a set corpus generation template and corpus generation rule;
Step 11: collecting table data, analyzing and processing to obtain a list table;
analyzing and processing different scripts according to different expression forms of a given structured data table; uniformly converting the table data into a list format of a list by writing a python script;
Step 12: setting a plurality of corpus generating templates according to different question-answering tasks;
the corpus generating template is provided with a corpus structure, a corpus format and placeholders, and the placeholders are replaced according to the form data;
step 13: setting corpus generation rules according to the characteristics and the structure of the table data;
the corpus generation rule comprises data type conversion, value mapping and text processing; the value mapping comprises mapping the numeric keys into corresponding character string values by adopting a dictionary mapping method;
Step 14: filling the list form into the corpus generation template according to the corpus generation rule to generate text corpus; mapping the fields to corresponding description words or phrases to generate text corpus;
Step 2: according to the text corpus, carrying out preference learning on the language big model based on prompt learning and fine tuning technology to obtain a question-answer model;
Step 21: generating a vector knowledge base by adopting a word2vec model according to the text corpus, and performing similarity matching calculation with a question text vector of a question-answering task to obtain a corpus fragment corresponding to the maximum similarity as auxiliary knowledge;
Step 22: generating a large model training data set according to the table data and the text corpus; the large model training data set comprises table data types, table data and text corpus;
Step 23: taking the corpus fragment as a Prompt, adopting an automatic Promptv2 method of P-tuningv2, performing preference learning training on the language big model according to the big model training data set, and optimizing a trainable continuous Prompt to obtain an optimal Prompt vector;
Step 24: performing fine adjustment on the trained parameters of the language big model through a LoRA fine adjustment method according to the optimal prompt vector to obtain a question-answer model;
assuming that the model parameter facing the form question-answering task is W, W is expressed as:
W=Wpretrain+ΔW+Tprompt
ΔW=MBMA
Wherein W pretrain is the original weight of the pre-training model, T prompt is the optimal prompt vector, and DeltaW is the model self parameter updated by fine adjustment of LoRA; m A and M B are respectively a bypass dimension-reducing matrix and a dimension-increasing matrix which are added beside the model;
Step 3: acquiring a question-answering task form and a question, and processing the question-answering task form and the question by adopting an improved RCI method to obtain form sequence data and a question type, and obtaining a question-answering answer by utilizing the question-answering model according to the form sequence data and the question type;
The improved RCI method includes new sequence processing and classification prediction; the new sequence processing ensures the understanding capability of the model to the form on the basis of simplifying the form data sequence processing, adds a problem classifier based on a transducer, and classifies the user questioning intention by combining the intention recognition model;
the specific implementation process of the step 3 is as follows:
Step 31: carrying out serialization processing on the question-answer task form by adopting a new sequence processing formula to obtain form sequence data, and attaching the form sequence data to a question with a CLS and an SEP token to form a form data-question sequence pair;
Step 32: inputting a table data-problem sequence pair into an encoder ALBERT, respectively using the output of a CLS token in a linear layer of the encoder ALBERT as a problem and a vector representation of column data in the table sequence data, connecting the problem vector, the table sequence column vector, the element product corresponding to the problem vector and the table sequence column vector and the square of element difference, and calculating the probability that the table sequence column vector belongs to a problem corresponding target by using a softmax function;
Step 33: selecting answer candidate rows from the table sequence data according to a set confidence threshold and the probability;
step 34: the problem vector is processed by a classifier, and the problem type is obtained through prediction;
step 35: combining the answer candidate row and the question type into an input prompt, and inputting the question-answer model to obtain a question-answer;
the new sequence processing formula is expressed as:
wherein T represents the table sequence data; An i-th data sequence representing the table sequence data; m represents the number of lines of the question-answer task table, n represents the number of columns of the question-answer task table, [ h 1,h2,…,hn ] represents the header of the question-answer task table, [ v i,j ] represents the cells in the question-answer task table; /(I) Representing adding a colon symbol after the header string; /(I)Representing that a vertical division number is added in front of the cell value string; /(I)Representing a tandem operation; h denotes adding a separator before and after each line of data sequence of the table sequence data.
CN202410382106.6A 2024-04-01 2024-04-01 Large model form question-answering method Active CN117972070B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410382106.6A CN117972070B (en) 2024-04-01 2024-04-01 Large model form question-answering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410382106.6A CN117972070B (en) 2024-04-01 2024-04-01 Large model form question-answering method

Publications (2)

Publication Number Publication Date
CN117972070A CN117972070A (en) 2024-05-03
CN117972070B true CN117972070B (en) 2024-06-18

Family

ID=90846469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410382106.6A Active CN117972070B (en) 2024-04-01 2024-04-01 Large model form question-answering method

Country Status (1)

Country Link
CN (1) CN117972070B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110054A (en) * 2019-03-22 2019-08-09 北京中科汇联科技股份有限公司 A method of obtaining question and answer pair in the slave non-structured text based on deep learning

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220044134A1 (en) * 2020-08-10 2022-02-10 Accenture Global Solutions Limited Intelligent question answering on tabular content
CN114020862B (en) * 2021-11-04 2024-06-11 中国矿业大学 Search type intelligent question-answering system and method for coal mine safety regulations
CN115062070A (en) * 2022-05-30 2022-09-16 中国电子科技集团公司第十研究所 Question and answer based text table data query method
CN115455935A (en) * 2022-09-14 2022-12-09 华东师范大学 Intelligent text information processing system
CN115794871A (en) * 2022-12-07 2023-03-14 电子科技大学 Table question-answer processing method based on Tapas model and graph attention network
CN116303971A (en) * 2023-03-29 2023-06-23 重庆交通大学 Few-sample form question-answering method oriented to bridge management and maintenance field
CN117743526A (en) * 2023-11-21 2024-03-22 浙江大学计算机创新技术研究院 Table question-answering method based on large language model and natural language processing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110054A (en) * 2019-03-22 2019-08-09 北京中科汇联科技股份有限公司 A method of obtaining question and answer pair in the slave non-structured text based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于语义模板的问答系统研究;梁正平 等;深圳大学学报(理工版);20070731;第24卷(第03期);第281-285页 *

Also Published As

Publication number Publication date
CN117972070A (en) 2024-05-03

Similar Documents

Publication Publication Date Title
CN112487143B (en) Public opinion big data analysis-based multi-label text classification method
CN112328767B (en) Question-answer matching method based on BERT model and comparative aggregation framework
CN109189925A (en) Term vector model based on mutual information and based on the file classification method of CNN
CN107562792A (en) A kind of question and answer matching process based on deep learning
CN111414461A (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN112015868A (en) Question-answering method based on knowledge graph completion
CN113312483A (en) Text classification method based on self-attention mechanism and BiGRU
CN112256866A (en) Text fine-grained emotion analysis method based on deep learning
CN114398976A (en) Machine reading understanding method based on BERT and gate control type attention enhancement network
CN111984791A (en) Long text classification method based on attention mechanism
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN114417872A (en) Contract text named entity recognition method and system
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
CN115687609A (en) Zero sample relation extraction method based on Prompt multi-template fusion
CN114841173A (en) Academic text semantic feature extraction method and system based on pre-training model and storage medium
CN112989803B (en) Entity link prediction method based on topic vector learning
CN117034921B (en) Prompt learning training method, device and medium based on user data
CN113920379A (en) Zero sample image classification method based on knowledge assistance
CN116543289B (en) Image description method based on encoder-decoder and Bi-LSTM attention model
CN117094395A (en) Method, device and computer storage medium for complementing knowledge graph
CN117972070B (en) Large model form question-answering method
CN112598065B (en) Memory-based gating convolutional neural network semantic processing system and method
CN114997331A (en) Small sample relation classification method and system based on metric learning
Jun et al. Hierarchical multiples self-attention mechanism for multi-modal analysis
CN111259650A (en) Text automatic generation method based on class mark sequence generation type countermeasure model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant