CN116821297A - Stylized legal consultation question-answering method, system, storage medium and equipment - Google Patents

Stylized legal consultation question-answering method, system, storage medium and equipment Download PDF

Info

Publication number
CN116821297A
CN116821297A CN202310768858.1A CN202310768858A CN116821297A CN 116821297 A CN116821297 A CN 116821297A CN 202310768858 A CN202310768858 A CN 202310768858A CN 116821297 A CN116821297 A CN 116821297A
Authority
CN
China
Prior art keywords
data
legal
stylized
question
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310768858.1A
Other languages
Chinese (zh)
Inventor
林凯文
姚昱材
于祥雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huayuan Computing Technology Shanghai Co ltd
Original Assignee
Huayuan Computing Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huayuan Computing Technology Shanghai Co ltd filed Critical Huayuan Computing Technology Shanghai Co ltd
Priority to CN202310768858.1A priority Critical patent/CN116821297A/en
Publication of CN116821297A publication Critical patent/CN116821297A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a stylized legal consultation question-answering method, a system, a storage medium and equipment, which relate to the technical field of natural language processing, wherein the method comprises the following steps: collecting text data and audio-video data in the legal consultation field and converting the text data and the audio-video data into characters; generating a structured text dataset from the text using the base model; dividing the structured text data set into a basic legal knowledge base and a stylized knowledge base according to style types of the seed instructions; adopting different styles of vertical field labeled training sets, and performing fine adjustment on the base model based on a low-rank approximation fine adjustment method to obtain a pre-training fine adjustment model; and generating answers of corresponding styles by the pre-training fine tuning model according to the received legal consultation questions of the user. According to the technical scheme, the method and the device for the query and answer of the user can iterate the stylized model more quickly, realize the low-delay query and answer requirement, provide legal consultation answers according to the style preferred by the user, and improve the working efficiency of legal consultation.

Description

Stylized legal consultation question-answering method, system, storage medium and equipment
Technical Field
The invention relates to the technical field of natural language processing, in particular to an iterative stylized legal consultation question-answering method, an iterative stylized legal consultation question-answering system, a computer-readable storage medium and an iterative stylized legal consultation question-answering terminal device.
Background
With the rapid development of science and technology, more and more industries greatly improve the production efficiency under AI energization, and have great effect in actual practice. In recent years, a new generation of artificial intelligence technology AIGC (AI-Generated Content, which uses artificial intelligence to generate content) is mainly an artificial intelligence technology implemented with deep learning as a theory and a pre-trained basic model as a base and with the addition of big data. The national artificial intelligence strategy 'new generation artificial intelligence development planning' mentions that the national advocates the application of artificial intelligence technology to the judicial field, deeply digs the application of artificial intelligence in legal collaboration tools and the like, surrounds the hot point problem of social management such as administrative management, judicial management, urban management, environmental protection and the like, promotes the application of artificial intelligence technology, and promotes the modernization of social management. In 2022, in the white paper generated by artificial intelligence commonly issued by the Chinese communication institute and the Beijing east exploration institute, the artificial intelligence generation content is guiding a profound innovation, remodelling and even subversion the production mode of digital content, and discussing a new state represented by virtual digital human. Under such data background, digitization has been revolutionized as a popular game, not only in the traditional fields of media, electronic commerce, etc., but also governments. The digital enabling basic-level letter consultation can bring personalized service to people while improving working efficiency to the greatest extent by efficiently and quickly processing complaint cases which are easy to understand and clear in arrangement for the masses, and the happiness and satisfaction are greatly improved.
In recent years, an intelligent consultation system designed by using a natural language algorithm provides the most similar result for users based on a search law consultation database through semantic relativity for consultation problems provided by the users. The limitation of the method is that the method is excessively dependent on data in a database, in actual problems such as actual interview consultation and the like, the expected consultation is different due to different backgrounds and different events of consultants, and the customized consultation effect aiming at personal preference of users cannot be realized, and in the field of the more mature pre-training language model in recent years, personalized and refined consultation services can be provided for different users by utilizing Artificial Intelligent Generated Content (AIGC).
In the traditional deep learning model, the parameters of each layer in the neural network are subjected to gradient calculation, and the full-parameter iteration mode of gradient updating consumes a large amount of calculation resources for the pre-training model containing large-scale parameters, so that the calculation time is long.
The pre-training language base model is a natural language processing technology based on deep learning, and is characterized in that the non-supervision learning is performed on large-scale text data, so that language capability different from the traditional network is obtained, the capability of the basic natural language model such as text semantic classification, text translation and the like is limited, and the high-order capabilities such as text dialogue, context semantic memory, answering, logic reasoning and the like are expanded.
The development history of the pre-trained language base model can be traced back to 2015, and Google proposed a language model based on a recurrent neural network, namely, a Recurrent Neural Network Language Model (RNNLM). The model can predict the probability of the occurrence of the next word by training on large-scale text data, thereby realizing the function of a language model. In 2018, openAI proposed a transducer-based pre-trained language base model, i.e., a transducer model. This model can effectively process long text through a self-attention mechanism and has better generalization capability. The success of the transducer model suggests a later researcher who further developed a series of pre-trained language base models such as BERT, GPT-2, etc.
Django is a WEB application framework of open source code developed based on Python, and a system platform with high safety and strong maintainability can be rapidly developed by adopting an MTV framework mode. At present, the programming language adopted by the artificial intelligent algorithm model is mainly Python, the preprocessing of image data and the deep neural network algorithm can be finished through some modules of Python, for example, the image preprocessing can be finished through OpenCV, pillow modules, scikit-image modules and the like, and the deep neural network algorithm can quickly construct a demand neural network through Tensorflow, pyTorch modules, keras modules and the like.
At present, the large-scale generation pre-training model of the ChatGPT class deployed in the server has the problems of slow reasoning, catton, large occupied video memory and robustness.
Disclosure of Invention
Aiming at the problems, the invention provides a stylized legal consultation question-answering method, a system, a storage medium and equipment, by generating a structured training text from unstructured data, the structured training text is beneficial to parameter fine adjustment of a base model, and a low-rank approximation efficient parameter fine adjustment method is utilized, so that the stylized model can be iterated more rapidly, the low-delay question-answering requirement can be realized, a local stylized knowledge base and a basic legal knowledge base are utilized on the basis of guaranteeing low-delay reasoning, answer sentences related to the context and the prompt words of the problems are generated on the basis of the questions of a user, the stability and the reliability of language model generation are guaranteed, legal rules and related consultation coaching based on the corresponding style and logic capability can be provided according to the legal consultation service style preferred by the user, and the working efficiency of legal consultation is improved.
In order to achieve the above object, the present invention provides an iterative stylized legal consultation question-answering method, comprising:
collecting user use question-answer data, audio-video data and legal knowledge data in the legal consultation field;
the audio data of the audio-video data are converted into characters by utilizing a voice-to-character technology, and caption areas of the video data in the audio-video data are identified by utilizing an OCR (optical character recognition) technology and are converted into characters;
generating a structured text data set by utilizing semantic understanding of a base model and text generating capability based on context, wherein the text is obtained by converting the user using question-answer data, legal knowledge data and audio-video data;
constructing a seed instruction database according to the user using question-answer data, and dividing the structured text data set into basic legal knowledge bases and stylized knowledge bases of different styles according to style types of seed instructions in the seed instruction database by the base model;
adopting different styles of vertical field labeled training sets, and performing fine adjustment on the base model based on a low-rank approximation fine adjustment method LoRA to obtain a pre-training fine adjustment model;
and generating answers of corresponding styles based on the basic legal knowledge base or the stylized knowledge base by the pre-training fine tuning model of the corresponding styles according to the style types of the received user legal consultation questions.
In the above technical solution, preferably, the converting the audio data of the audio and video data into text by using a voice-to-text technology, and identifying the subtitle region of the video data in the audio and video data and converting the subtitle region into text by using an OCR recognition technology includes:
reading the audio data by utilizing a voice-to-text technology based on artificial intelligence, and converting voice corresponding to the audio data into text;
and reading the video data, detecting the frame position according to a user-defined preset input frame, performing character recognition on the video subtitle at the frame position by utilizing an OCR recognition technology based on deep learning, and converting the video subtitle into text characters of a corresponding language.
In the above technical solution, preferably, the text generating capability based on the semantic understanding and the context of the base model generates a structured text data set by converting the user using question-answer data, the legal knowledge data and the audio-video data, and the specific process includes:
extracting key text abstracts from characters obtained by converting the user using question-answer data, legal knowledge data and audio-video data by utilizing semantic understanding capability of a base model;
the method comprises the steps of utilizing text generating capacity of a base model based on context to manufacture an extracted key text abstract into a structured text in a form of 'instruction-prompt-answer';
and carrying out batch processing on the words obtained by the user through the question and answer data, the legal knowledge data and the audio and video data conversion, and forming the structured text data set by the obtained structured text set.
In the above technical solution, preferably, the constructing a seed instruction database according to the user using question-answer data, and the base model divides the structured text dataset into basic legal knowledge bases and stylized knowledge bases of different styles according to style types of seed instructions in the seed instruction database, and the specific process includes:
analyzing the user practical question-answering data, determining seed instructions in different vertical fields, and constructing a seed instruction database by utilizing the seed instructions;
and regenerating a new instruction for the structured text in the structured text data set according to the style category of the seed instruction in the seed instruction database to form a data set in an instruction-answer form, so as to obtain an enhanced basic legal knowledge base and a stylized knowledge base.
In the above technical solution, preferably, the vertical field with different styles has a tag training set, fine tuning is performed on the base model based on a low-rank approximation fine tuning method LoRA, to obtain a pre-training fine tuning model, and the specific process includes:
labeling the data sets in different vertical fields to obtain labeled training sets in different style types;
training the base model by using the labeled training set;
freezing parameters of the base model, and updating a trainable attention weight matrix in a transform architecture of the base model in a full-scale updating mode to a mode of marking parameter variation in a low-rank matrix decomposition mode by adopting a LoRA mechanism to obtain the pre-training fine tuning model.
In the above technical solution, preferably, the generating, by the pre-training fine tuning model of the corresponding style according to the style type of the received user legal consultation question, an answer of the corresponding style based on the basic legal knowledge base or the stylized knowledge base includes:
when receiving a user legal consultation problem sent by a user, determining the style type of the user legal consultation problem according to the semantic recognition capability of the base model;
inputting the user legal consultation questions into the pre-training fine tuning model of a corresponding style aiming at the style types of the user legal consultation questions;
the pre-training fine tuning model screens the basic legal knowledge base or the stylized knowledge base according to the user legal consultation questions and the corresponding style types to obtain 'instruction-answer' data with the correlation similarity with the user legal consultation questions at a preset threshold value or a preset quantity;
and generating and outputting an answer sentence with a corresponding style according to the instruction-answer data.
In the above technical solution, preferably, the iterative stylized legal consultation question-answering method further includes:
extracting a new instruction according to the received legal consultation problem of the user, and feeding back the new instruction to the seed instruction database;
and generating new instruction-answer data of the corresponding style according to the generated answer sentence, and storing the new instruction-answer data in the basic legal knowledge base or the stylized knowledge base of the corresponding style.
The invention also provides an iterative stylized legal consultation question-answering system, which applies the iterative stylized legal consultation question-answering method disclosed by any one of the technical schemes, and comprises the following steps:
the data collection module is used for collecting question and answer data, audio and video data and legal knowledge data used by users in the legal consultation field;
the data conversion module is used for converting the audio data of the audio and video data into characters by utilizing a voice-to-character technology, and identifying the subtitle region of the video data in the audio and video data by utilizing an OCR (optical character recognition) technology and converting the subtitle region into the characters;
the data processing module is used for generating a structured text data set by utilizing semantic understanding of the base model and text generation capacity based on context, and converting the user using question-answer data, the legal knowledge data and the audio-video data into characters;
the knowledge classification module is used for constructing a seed instruction database according to the user using question-answer data, and the base model divides the structured text data set into basic legal knowledge bases and stylized knowledge bases of different styles according to style types of seed instructions in the seed instruction database;
the model fine adjustment module is used for adopting label training sets in vertical fields of different styles, and carrying out fine adjustment on the base model based on a low-rank approximation fine adjustment method LoRA to obtain a pre-training fine adjustment model;
and the question answering module is used for generating answers of corresponding styles based on the basic legal knowledge base or the stylized knowledge base by the pre-training fine tuning model of the corresponding styles according to the style types of the received user legal consultation questions.
The invention also proposes a computer readable storage medium storing at least one instruction that is executable by a processor to implement an iteratable stylized legal counseling question-answering method as disclosed in any one of the above technical solutions.
The invention also provides an iteratable stylized legal consultation question-answering terminal device, which comprises a memory and a processor, wherein the memory is used for storing at least one instruction, and the processor is used for executing the at least one instruction so as to realize the iteratable stylized legal consultation question-answering method disclosed in any one of the technical schemes.
Compared with the prior art, the invention has the beneficial effects that: by generating the structured training text from unstructured data, the structured training text is beneficial to parameter fine adjustment of a base model, and by utilizing a high-efficiency parameter fine adjustment method of low-rank approximation, a stylized model can be iterated more quickly, low-delay question-answering requirements are realized, on the basis of guaranteeing low-delay reasoning, a local stylized knowledge base and a basic legal knowledge base are utilized, answer sentences related to the context and prompt words of a question are generated on the basis of a question of a user based on style categories, stability and reliability of language model generation are guaranteed, legal laws and legal regulations and related consultation coaching based on language gases and logic capabilities of corresponding styles can be provided according to legal consultation service styles preferred by the user, and the working efficiency of legal consultation is improved.
Drawings
FIG. 1 is a flow chart of an iterative stylized legal counseling question and answer method disclosed in one embodiment of the present invention;
FIG. 2 is a schematic diagram of data conversion and data processing according to an embodiment of the present invention;
FIG. 3 is a flow chart of a question answering process using a pre-trained and trimmed base model according to one embodiment of the present invention;
FIG. 4 is a flow chart of a pre-training fine tuning process of a base model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a LoRA method parameter update method according to an embodiment of the present invention;
FIG. 6 is a block diagram of an iteratable stylized legal counseling system in accordance with one embodiment of the present invention.
In the figure, the correspondence between each component and the reference numeral is:
1. the system comprises a data collection module, a data conversion module, a data processing module, a knowledge classification module, a model fine adjustment module and a question answering module.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention is described in further detail below with reference to the attached drawing figures:
as shown in fig. 1, the method for iteratively stylized legal consultation question-answering provided by the invention comprises the following steps:
collecting user use question-answer data, audio-video data and legal knowledge data in the legal consultation field;
the audio data of the audio and video data are converted into characters by utilizing a voice-to-character technology, and caption areas of the video data in the audio and video data are identified by utilizing an OCR (optical character recognition) technology and are converted into characters;
generating a structured text data set by utilizing semantic understanding of the base model and text generating capability based on context, and converting words obtained by converting user using question-answer data, legal knowledge data and audio-video data;
constructing a seed instruction database according to the user using the question-answer data, and dividing the structured text data set into basic legal knowledge bases and stylized knowledge bases of different styles according to style types of seed instructions in the seed instruction database by a base model;
adopting different styles of vertical field labeled training sets, and performing fine adjustment on the base model based on a low-rank approximation fine adjustment method LoRA to obtain a pre-training fine adjustment model;
and generating answers of the corresponding styles based on the basic legal knowledge base or the stylized knowledge base by the pre-training fine tuning model of the corresponding styles according to the style types of the received legal consultation questions of the user.
In the embodiment, the unstructured data is used for generating the structured training text, the structured training text is beneficial to parameter fine adjustment of the base model, a low-rank approximation efficient parameter fine adjustment method is utilized, a stylized model can be iterated more rapidly, low-delay question-answering requirements are achieved, on the basis of guaranteeing low-delay reasoning, a local stylized knowledge base and a basic legal knowledge base are utilized, answer sentences related to the context and the prompt words of the questions are generated on the basis of the style category of a question of a user, stability and reliability of language model generation are guaranteed, legal laws and relevant counseling based on the corresponding style and logic capability can be provided according to the legal counseling service style preferred by the user, and the working efficiency of the legal counseling is improved.
Specifically, the invention aims to provide a sustainable iterative stylized language model for consulting and asking questions of a set law-related problem, which provides quick and easy-to-understand professional questions and answers for actual basic-level letter consultation and civil criminal case-related consultation, and can provide friendly professional questions and answers for users after extracting case elements. The sustainable iterative stylized legal consultation question-answering method completes an end-to-end complete language model training process, in the implementation process, a Django application framework can be utilized for management and maintenance, and automatic data screening, enhancement processing and storage, model timing training and intelligent deployment of training models can be achieved.
In the implementation process, the method can be realized through a Python programming language, and can effectively avoid the problem of unnecessary compatibility, wherein an informationized system adopts a Django framework, an algorithm model adopts a PyTorch framework as a main part, and structured data uniformly adopts MySQL 5.7 and above version databases.
As shown in fig. 2, in the above embodiment, it is preferable that the audio data of the audio/video data is converted into text by using a voice-to-text technology, and the subtitle region of the video data in the audio/video data is recognized and converted into text by using an OCR recognition technology, and the specific process includes:
reading audio data by utilizing a voice-to-text technology based on artificial intelligence, and converting voice corresponding to the audio data into text;
and reading video data, detecting the frame position according to a user-defined preset input frame, performing character recognition on the video subtitle at the frame position by utilizing an OCR recognition technology based on deep learning, and converting the video subtitle into text characters of a corresponding language.
Specifically, in the implementation process, at a data processing end, a system decision logic is provided, classification and structuring processing are performed on sample data generated by multiple terminals, the type of the data is judged through an automatic program, and local corpus data is automatically processed, so that massive stored data such as local user use records, open source audio and video information and the like are converted into usable text information. Compared with the traditional method that the Chinese training data set is generated without spoken text information in audio and video, the method has wider data sources.
The specific process comprises the following steps:
(1) When a user facing a complete data question-answer class uses question-answer data, the user is automatically identified through Python and prepared into a matched data set of a type of 'instructions-prompts-questions-answers';
(2) When facing to a complete voice data set, the method realizes the function of converting the AI voice into the text by using artificial intelligence, and stores the generated unstructured text;
(3) When the video data is faced, utilizing OCR visual ability based on OpenCV deep learning, detecting the positions of the boxes according to the user-defined input boxes, and carrying out batch recognition on the selected text boxes to obtain txt text data of Chinese and English subtitles corresponding to video time in a video lower box area. When the caption under the video is recognized by utilizing OCR, sensitive information deletion is carried out on the txt file of the caption.
As shown in fig. 2, in the foregoing embodiment, preferably, by using semantic understanding of the base model and text generating capability based on context, a text generated structured text data set obtained by converting user using question-answer data, legal knowledge data and audio-video data includes the following specific procedures:
the semantic understanding capability of the base model is utilized to extract key text abstracts from characters obtained by converting the user by using question-answer data, legal knowledge data and audio-video data;
the method comprises the steps of utilizing text generating capacity of a base model based on context to manufacture an extracted key text abstract into a structured text in a form of 'instruction-prompt-answer';
and carrying out batch processing on characters obtained by converting the question-answer data, legal knowledge data and audio-video data used by the user, and forming a structured text data set from the obtained structured text set.
In the implementation process, massive unstructured text data provided by characters obtained by converting question and answer data, legal knowledge data and audio and video data are used by a user, wherein the method mainly comprises the following steps: knowledge data (legal text, legal information, judgment book and other text) and audio-visual text (spoken text with speaker style) related to legal knowledge need to be made into structured data in the form of structured "instruction-prompt-answer" and stored in a storage device.
Based on semantic understanding of a base model and text generation capability based on context, abstract extraction and legal knowledge question-answer extraction are carried out on video text information by utilizing the self-guiding capability of the base model, key text abstract extraction is carried out in a text by utilizing base model degree, screening is carried out on the basis of coincidence indexes of generated contents, new stylized structured question-answer pair data are generated, and a data enhancement effect is achieved. The method can generate the instruction-question-answer data set with high relevance to the original text in batches, and the data enhancement method shows better instruction following capability.
According to the stylized legal consultation question-answering method disclosed in the above embodiment, in a specific implementation process, a data collection process and a data conversion process thereof are further described by way of examples in the following table.
In the above embodiment, preferably, a seed instruction database is constructed according to the question and answer data used by the user, and the base model divides the structured text data set into basic legal knowledge bases and stylized knowledge bases of different styles according to style types of the seed instructions in the seed instruction database, and the specific process includes:
analyzing user practical question-answering data, determining seed instructions in different vertical fields, and constructing a seed instruction database by utilizing the seed instructions;
and regenerating new instructions for the structured texts in the structured text data set according to style types of the seed instructions in the seed instruction database to form a data set in an instruction-answer form, so as to obtain an enhanced basic legal knowledge base and a stylized knowledge base.
In this embodiment, the seed instruction database is a database designed according to the user usage record and input data, and contains most of the user instructions in the legal consultation scenario in the vertical field. After the seed instruction is input, the base model firstly judges the category of the seed instruction, because the instruction of the basic legal knowledge base is different from the instruction in the stylized knowledge base. According to the instructions of the stylized knowledge base and the instructions of the basic legal knowledge base, new instructions are respectively generated, and corresponding instruction-answer question-answer pairs are generated by taking each unstructured text in the database as an input text.
The following are examples:
as shown in fig. 3, in the foregoing embodiment, preferably, the vertical domain with different styles has a tag training set, and the base model is fine-tuned based on a low-rank approximation fine-tuning method LoRA to obtain a pre-training fine-tuning model, and the specific process includes:
labeling the data sets in different vertical fields to obtain labeled training sets in different style types;
training the base model by using a labeled training set;
freezing parameters of the base model, and updating a trainable attention weight matrix in a transform architecture of the base model in a full-scale updating mode to a mode of marking parameter variation in a low-rank matrix decomposition mode by adopting a LoRA mechanism to obtain a pre-training fine tuning model.
In the embodiment, the base model can have stronger user instruction following capability after fine adjustment training, and particularly aims at the problem of insufficient understanding of instructions in the legal consultation vertical field in the original base model.
Specifically, the fine tuning technique is a common technique in the field of natural language processing, and is essentially a supervised learning method, i.e., a method for supervised training of a pre-training model using relatively small-scale task-specific text. For a pre-training model containing large-scale parameters, the method can reduce the consumption of calculation resources, greatly reduce the calculation time, improve the calculation efficiency, and even improve the accuracy and the generalization capability of the vertical field under partial conditions.
In this embodiment, a low rank approach fine tuning (LoRA) technique, which is a highly efficient fine tuning technique for parameters, is used in the step of updating the parameters of the model, as shown in fig. 4 and 5, since the weight matrix of the pre-training model has a low intrinsic dimensionality, in the process of updating the parameters of the model, all parameters of the original pre-training model are frozen, and a bypass is introduced, and meanwhile, the LoRA mechanism is used, the low rank characteristics of the intrinsic model in the large base model are fully utilized, and the full parameter fine tuning process of the full model is simulated by utilizing the bypass, so that the update mode of the trainable attention weight matrix in the Transformer architecture is updated to the mode of representing the parameter variation by utilizing the low rank matrix decomposition. The advantage of LoRA is that the number of updated model parameters per layer of the transducer architecture is reduced, thus improving the throughput of training without increasing the inference delay.
Specifically, assume W 0 Is a parameter of a pre-training model, and the updated parameter is W 1
W 1 =W 0 +ΔW=W 0 +BA
Wherein W is 0 Is a weight matrix (W) in a pre-training model transducer architecture q ,W k ,W v ,W o ) Any one of the four weight matrices or the weight matrix of the multi-layer perceptron layer.
The LoRA method can select weights with different numbers to carry out efficient parameter model fine adjustment according to the calculation force requirement and the precision requirement of a user.
The number of parameters that need to be updated for the lore method used in this embodiment can be quantitatively expressed as:
|Θ|=2L Lora rd
wherein r is rank, d is input dimension, L Lora The number of weights fine-tuned for LoRA.
An embodiment of low rank approximation simulation parameter update is as follows:
the parameters were set as follows:
LoRA_R (rank) LoRA_Alpha LoRA_DropOut Target_Modules
8 16 0.05 "query_key_value"
Wherein, target_modules: setting a weight matrix for parameter updating selected by the LoRA method;
Query_key_value:W_Q,W_k,W_v;
LoRA_R: the parameters set the rank of the updated parameter matrix in the bypass of the pre-training weight matrix;
LoRA_alpha: alpha generally refers to a super parameter in regularization, which is used for controlling trade-off between model complexity and generalization capability, wherein regularization is a common technique for preventing overfitting, and reduces model complexity and improves generalization capability by punishing model parameters;
LoRA_Dropout: dropout is a technique used in neural networks to prevent overfitting, i.e., the output to some neurons during training is randomly set to 0.
In the foregoing embodiment, preferably, according to the style type of the received user legal consultation problem, generating, by the pre-training fine tuning model of the corresponding style, an answer of the corresponding style based on the basic legal knowledge base or the stylized knowledge base, the specific process includes:
when receiving a user legal consultation problem sent by a user, determining the style type of the user legal consultation problem according to the semantic recognition capability of the base model;
aiming at the style types of the legal consultation questions of the user, inputting the legal consultation questions of the user into a pre-training fine-tuning model of a corresponding style;
screening a basic legal knowledge base or a stylized knowledge base according to the user legal consultation questions and the corresponding style types by the pre-training fine tuning model to obtain instruction-answer data with the correlation similarity with the user legal consultation questions at a preset threshold value or a preset quantity;
according to the instruction-answer data, answer sentences with corresponding styles are generated and output.
Specifically, through the legal understanding of the pretrained fine tuning model, the question and answer reliability problem is solved, the model is connected to a local expandable legal corpus knowledge base on the basis of a dictionary of a base model, aiming at instructions input by a user side, firstly, by means of the semantic recognition capability of the base model, legal knowledge related to TopK is screened as a prompt instruction and a context in the legal corpus knowledge base according to sentence similarity, so that input information of an instruction-prompt is formed, and the pretrained fine tuning model is induced to output more reliable legal consultation results while stylized solutions are kept.
In the implementation process, the pre-training model is finely tuned according to the style of the law teaching corpus, and the effect of the pre-training fine tuning model is demonstrated through the examples in the following table:
for the example of the table above, it can be seen that:
in the response effect of the basic model, the model gives out the law of criminal suspects illegal in cases according to the input requirement of users, wherein the explanation of intentional injury and opponent crime is related, the explanation mainly shows professional kissing decided in legal documents, and the audiences are legal professional practitioners.
The output result of the pretrained fine tuning model is used for carrying out spoken overview on the cases, and the critical details such as the key description Wu Mou for grasping the overground dumbbell are extracted and are described, and the behavior severity and the corresponding crime conviction are described easily by combining the criminal suspects and the cases, so that the infringement law is more easily understood and accepted by the general public.
In the above embodiment, preferably, the iterative stylized law consultation question-answering method further includes:
extracting a new instruction according to the received legal consultation problem of the user, and feeding back the new instruction to a seed instruction database;
and generating new instruction-answer data of the corresponding style type according to the generated answer sentence, and storing the new instruction-answer data in a basic legal knowledge base or a stylized knowledge base of the corresponding style type.
By continuously enriching the seed instruction database and the underlying legal knowledge base and the stylized knowledge base, more sufficient training data can be provided for the model, so that the database can be continuously updated.
As shown in fig. 6, the present invention further provides an iterative stylized legal consultation question-answering system, and the iterative stylized legal consultation question-answering method disclosed in any one of the above embodiments is applied, including:
the data collection module 1 is used for collecting question and answer data, audio and video data and legal knowledge data used by users in the legal consultation field;
the data conversion module 2 is used for converting the audio data of the audio and video data into characters by utilizing a voice-to-character technology, identifying the caption area of the video data in the audio and video data by utilizing an OCR (optical character recognition) technology and converting the caption area into the characters;
the data processing module 3 is used for generating a structured text data set by utilizing semantic understanding of the base model and text generation capacity based on context, and converting words obtained by converting the user by using question-answer data, legal knowledge data and audio-video data;
the knowledge classification module 4 is used for constructing a seed instruction database according to the question and answer data used by the user, and the base model divides the structured text data set into basic legal knowledge bases and stylized knowledge bases of different styles according to style types of seed instructions in the seed instruction database;
the model fine adjustment module 5 is used for adopting different styles of vertical field labeled training sets, and carrying out fine adjustment on the base model based on a low-rank approximation fine adjustment method LoRA to obtain a pre-training fine adjustment model;
and the question answering module 6 is used for generating answers of corresponding styles based on the basic legal knowledge base or the stylized knowledge base by the pre-training fine tuning model of the corresponding styles according to the style types of the received user legal consultation questions.
In this embodiment, functions to be implemented by each module in the iterative stylized legal inquiry and answering system are adapted to the embodiments in the legal inquiry and answering method disclosed in the foregoing embodiment, and in the implementation process, the implementation may be performed with reference to the method in the foregoing embodiment, which is not described herein again.
The present invention also proposes a computer-readable storage medium storing at least one instruction executable by a processor to implement an iteratable stylized legal counseling question-answering method as disclosed in any one of the above embodiments.
The invention also provides an iteratable stylized legal consultation question and answer terminal device, which comprises a memory and a processor, wherein the memory is used for storing at least one instruction, and the processor is used for executing the at least one instruction so as to realize the iteratable stylized legal consultation question and answer method disclosed in any one of the above embodiments.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An iteratable stylized legal consultation question-answering method, comprising:
collecting user use question-answer data, audio-video data and legal knowledge data in the legal consultation field;
the audio data of the audio-video data are converted into characters by utilizing a voice-to-character technology, and caption areas of the video data in the audio-video data are identified by utilizing an OCR (optical character recognition) technology and are converted into characters;
generating a structured text data set by utilizing semantic understanding of a base model and text generating capability based on context, wherein the text is obtained by converting the user using question-answer data, legal knowledge data and audio-video data;
constructing a seed instruction database according to the user using question-answer data, and dividing the structured text data set into basic legal knowledge bases and stylized knowledge bases of different styles according to style types of seed instructions in the seed instruction database by the base model;
adopting different styles of vertical field labeled training sets, and performing fine adjustment on the base model based on a low-rank approximation fine adjustment method LoRA to obtain a pre-training fine adjustment model;
and generating answers of corresponding styles based on the basic legal knowledge base or the stylized knowledge base by the pre-training fine tuning model of the corresponding styles according to the style types of the received user legal consultation questions.
2. The iterative stylized legal counseling question-answering method according to claim 1, wherein the voice-to-text technology is used to convert the audio data of the audio-video data into text, and the OCR recognition technology is used to recognize the subtitle region of the video data in the audio-video data and convert the subtitle region into text, and the specific process includes:
reading the audio data by utilizing a voice-to-text technology based on artificial intelligence, and converting voice corresponding to the audio data into text;
and reading the video data, detecting the frame position according to a user-defined preset input frame, performing character recognition on the video subtitle at the frame position by utilizing an OCR recognition technology based on deep learning, and converting the video subtitle into text characters of a corresponding language.
3. The iterative stylized legal consultation question-answering method according to claim 2, wherein the text generation capability based on context and semantic understanding of the base model is utilized to generate a structured text data set from text obtained by converting the user using question-answer data, legal knowledge data and audio-video data, and the specific process includes:
extracting key text abstracts from characters obtained by converting the user using question-answer data, legal knowledge data and audio-video data by utilizing semantic understanding capability of a base model;
the method comprises the steps of utilizing text generating capacity of a base model based on context to manufacture an extracted key text abstract into a structured text in a form of 'instruction-prompt-answer';
and carrying out batch processing on the words obtained by the user through the question and answer data, the legal knowledge data and the audio and video data conversion, and forming the structured text data set by the obtained structured text set.
4. The iterative stylized legal consultation question-answering method according to claim 3, wherein the creating a seed instruction database according to question-answer data of the user, the base model divides the structured text data set into basic legal knowledge bases and stylized knowledge bases of different styles according to style categories of seed instructions in the seed instruction database, and the specific process includes:
analyzing the user practical question-answering data, determining seed instructions in different vertical fields, and constructing a seed instruction database by utilizing the seed instructions;
and regenerating a new instruction for the structured text in the structured text data set according to the style category of the seed instruction in the seed instruction database to form a data set in an instruction-answer form, so as to obtain an enhanced basic legal knowledge base and a stylized knowledge base.
5. The iterative stylized legal consultation question-answering method according to claim 4, wherein the vertical field with different styles has a tag training set, the base model is fine-tuned based on a low-rank approximation fine-tuning method lorea to obtain a pre-training fine-tuning model, and the specific process includes:
labeling the data sets in different vertical fields to obtain labeled training sets in different style types;
training the base model by using the labeled training set;
freezing parameters of the base model, and updating a trainable attention weight matrix in a transform architecture of the base model in a full-scale updating mode to a mode of marking parameter variation in a low-rank matrix decomposition mode by adopting a LoRA mechanism to obtain the pre-training fine tuning model.
6. The iterative stylized legal consultation question-answering method according to claim 5, wherein the generating, by the pre-trained fine tuning model of the corresponding style according to the style type of the received user legal consultation question, an answer of the corresponding style based on the basic legal knowledge base or the stylized knowledge base comprises the following specific procedures:
when receiving a user legal consultation problem sent by a user, determining the style type of the user legal consultation problem according to the semantic recognition capability of the base model;
inputting the user legal consultation questions into the pre-training fine tuning model of a corresponding style aiming at the style types of the user legal consultation questions;
the pre-training fine tuning model screens the basic legal knowledge base or the stylized knowledge base according to the user legal consultation questions and the corresponding style types to obtain 'instruction-answer' data with the correlation similarity with the user legal consultation questions at a preset threshold value or a preset quantity;
and generating and outputting an answer sentence with a corresponding style according to the instruction-answer data.
7. The iterative stylized legal consultation question-answering method of claim 6, further comprising:
extracting a new instruction according to the received legal consultation problem of the user, and feeding back the new instruction to the seed instruction database;
and generating new instruction-answer data of the corresponding style according to the generated answer sentence, and storing the new instruction-answer data in the basic legal knowledge base or the stylized knowledge base of the corresponding style.
8. An iteratable stylized legal counseling question-answering system, characterized in that an iteratable stylized legal counseling question-answering method according to any one of claims 1 to 7 is applied, comprising:
the data collection module is used for collecting question and answer data, audio and video data and legal knowledge data used by users in the legal consultation field;
the data conversion module is used for converting the audio data of the audio and video data into characters by utilizing a voice-to-character technology, and identifying the subtitle region of the video data in the audio and video data by utilizing an OCR (optical character recognition) technology and converting the subtitle region into the characters;
the data processing module is used for generating a structured text data set by utilizing semantic understanding of the base model and text generation capacity based on context, and converting the user using question-answer data, the legal knowledge data and the audio-video data into characters;
the knowledge classification module is used for constructing a seed instruction database according to the user using question-answer data, and the base model divides the structured text data set into basic legal knowledge bases and stylized knowledge bases of different styles according to style types of seed instructions in the seed instruction database;
the model fine adjustment module is used for adopting label training sets in vertical fields of different styles, and carrying out fine adjustment on the base model based on a low-rank approximation fine adjustment method LoRA to obtain a pre-training fine adjustment model;
and the question answering module is used for generating answers of corresponding styles based on the basic legal knowledge base or the stylized knowledge base by the pre-training fine tuning model of the corresponding styles according to the style types of the received user legal consultation questions.
9. A computer-readable storage medium storing at least one instruction executable by a processor to implement the iterative stylized legal counseling question-answering method of any one of claims 1 to 7.
10. An iteratable stylized legal counseling question and answer terminal device, characterized in that the terminal device comprises a memory for storing at least one instruction and a processor for executing the at least one instruction to implement the iteratable stylized legal counseling question and answer method of any one of claims 1 to 7.
CN202310768858.1A 2023-06-27 2023-06-27 Stylized legal consultation question-answering method, system, storage medium and equipment Pending CN116821297A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310768858.1A CN116821297A (en) 2023-06-27 2023-06-27 Stylized legal consultation question-answering method, system, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310768858.1A CN116821297A (en) 2023-06-27 2023-06-27 Stylized legal consultation question-answering method, system, storage medium and equipment

Publications (1)

Publication Number Publication Date
CN116821297A true CN116821297A (en) 2023-09-29

Family

ID=88125307

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310768858.1A Pending CN116821297A (en) 2023-06-27 2023-06-27 Stylized legal consultation question-answering method, system, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN116821297A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117669737A (en) * 2023-12-20 2024-03-08 中科星图数字地球合肥有限公司 Method for constructing and using large language model in end-to-end geographic industry

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117669737A (en) * 2023-12-20 2024-03-08 中科星图数字地球合肥有限公司 Method for constructing and using large language model in end-to-end geographic industry
CN117669737B (en) * 2023-12-20 2024-04-26 中科星图数字地球合肥有限公司 Method for constructing and using large language model in end-to-end geographic industry

Similar Documents

Publication Publication Date Title
CN117033608B (en) Knowledge graph generation type question-answering method and system based on large language model
CN109960804B (en) Method and device for generating topic text sentence vector
CN110647619A (en) Common sense question-answering method based on question generation and convolutional neural network
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN113626589B (en) Multi-label text classification method based on mixed attention mechanism
CN117009490A (en) Training method and device for generating large language model based on knowledge base feedback
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
Wang et al. Kga: A general machine unlearning framework based on knowledge gap alignment
CN112287090A (en) Financial question asking back method and system based on knowledge graph
CN110765285A (en) Multimedia information content control method and system based on visual characteristics
CN113988079A (en) Low-data-oriented dynamic enhanced multi-hop text reading recognition processing method
CN116821297A (en) Stylized legal consultation question-answering method, system, storage medium and equipment
CN112699218A (en) Model establishing method and system, paragraph label obtaining method and medium
CN116796251A (en) Poor website classification method, system and equipment based on image-text multi-mode
CN111159405B (en) Irony detection method based on background knowledge
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
CN115455144A (en) Data enhancement method of completion type space filling type for small sample intention recognition
Kastrati et al. Transfer learning to timed text based video classification using CNN
CN113342953A (en) Government affair question and answer method based on multi-model integration
CN113011141A (en) Buddha note model training method, Buddha note generation method and related equipment
Qi et al. A network pruning method for remote sensing image scene classification
CN111159360A (en) Method and device for obtaining query topic classification model and query topic classification
Baskota Classification of ad tone in political video advertisements under class imbalance and low data samples
CN116484010B (en) Knowledge graph construction method and device, storage medium and electronic device
CN112651403B (en) Zero-sample visual question-answering method based on semantic embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination