CN113220853B - Automatic generation method and system for legal questions - Google Patents

Automatic generation method and system for legal questions Download PDF

Info

Publication number
CN113220853B
CN113220853B CN202110514787.3A CN202110514787A CN113220853B CN 113220853 B CN113220853 B CN 113220853B CN 202110514787 A CN202110514787 A CN 202110514787A CN 113220853 B CN113220853 B CN 113220853B
Authority
CN
China
Prior art keywords
legal
questions
classes
question
fused
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110514787.3A
Other languages
Chinese (zh)
Other versions
CN113220853A (en
Inventor
冯建周
龙景
韩春龙
邵文彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shengming Jizhi Beijing Technology Co ltd
Original Assignee
Yanshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yanshan University filed Critical Yanshan University
Priority to CN202110514787.3A priority Critical patent/CN113220853B/en
Publication of CN113220853A publication Critical patent/CN113220853A/en
Application granted granted Critical
Publication of CN113220853B publication Critical patent/CN113220853B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method and a system for automatically generating legal questions, which comprises the steps of firstly, constructing a legal question set under a specific scene; the legal question set comprises n questions; secondly, based on a semantic similarity principle, performing text clustering on the legal problem set to obtain m types; then adding and summing the importance degrees of a plurality of problems in each category to obtain the sum of the importance degrees of the problems; and finally, inputting a set number of problems to be fused selected from the K classes with the highest problem importance sum to a pre-training language fine tuning model by adopting a text abstract algorithm for fusion respectively, automatically converting the problems into question forms, and proposing the K problems to a user. In the legal consultation process, a plurality of questions with similar semantics are fused and condensed to form a high-profile fusion question without throwing out a single question one by one, and then a plurality of questions are asked simultaneously, so that more information can be obtained in each pair of conversations, the conversation process is shortened, the working efficiency is accelerated, and the user experience is improved.

Description

Automatic generation method and system for legal questions
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a system for automatically generating legal questions.
Background
Most of the existing legal consultation systems design a large number of questions aiming at a specific scene, ask questions of a user in sequence, and perform branch judgment according to answers of the user to finally obtain a consultation result. However, the consultation process of the consultation system is complicated and single, and only single problems can be thrown one by one, so that the working efficiency and the user experience are influenced.
Disclosure of Invention
The invention aims to provide a method and a system for automatically generating legal questions, which are used for simultaneously presenting K questions and improving the working efficiency and the user experience.
In order to achieve the above object, the present invention provides an automatic generation method of legal questions, comprising:
constructing a legal problem set under a specific scene; the legal question set comprises n questions, wherein n is a positive integer greater than 2;
based on a semantic similarity principle, performing text clustering on the legal question set to obtain m types; each class includes at least one question;
adding and summing the importance degrees of a plurality of problems in each category to obtain the sum of the importance degrees of the problems;
inputting a set number of problems to be fused selected from K types with the highest problem importance sum into a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms, and proposing K problems to a user; the problem with the number of problems in each category being greater than 1 is called the problem to be fused.
Optionally, based on the semantic similarity principle, performing text clustering on the legal problem set to obtain m classes, which specifically includes:
converting the legal problems in the legal problem set into vectors corresponding to the legal problems, and constructing a vector set;
calculating the distance between any two vectors in the vector set by adopting a cosine similarity method;
taking each vector as a class, and initially counting n classes;
merging the two vectors with the minimum distance into a class by adopting a coacervation hierarchical clustering algorithm;
judging whether the total number of classes is less than or equal to the target number m; if the total number of classes is less than or equal to the target number of classes, outputting each class to form a class set; and if the total number of the classes is greater than the number of the target classes, returning to the step of combining the two vectors with the minimum distance into one class by adopting an agglomeration hierarchical clustering algorithm, wherein the class set comprises m classes.
Optionally, the method includes selecting a set number of to-be-fused questions from the K classes with the highest problem importance sum by using a text summarization algorithm, inputting the selected to-be-fused questions into a pre-trained language fine tuning model for fusion, automatically converting the to-be-fused questions into question forms, and presenting the K questions to a user, and specifically includes:
inserting a mark into the beginning of each problem to be fused, and distinguishing a plurality of input problems to be fused by using interval segments to obtain a problem sequence vector;
inputting the problem sequence vector into a coder in a pre-training language fine tuning model for coding to obtain a coding sequence;
extracting the characteristics of the coding sequence through a deep neural network and a multi-head attention mechanism to obtain a characteristic extraction sequence;
inputting the characteristic extraction sequence into a decoder in a pre-training language fine tuning model for character reduction to obtain an initial problem;
and automatically converting the initial questions into question forms, and presenting K questions to the user.
Optionally, before the step of inserting a marker at the beginning of each problem to be fused and distinguishing the plurality of input problems to be fused by using a spacer, the step of obtaining a problem sequence vector further includes:
acquiring a training data set; the training data set includes a plurality of questions processed by a data format;
optimizing an encoder and a decoder in a pre-training language model;
and carrying out parameter fine adjustment on the optimized pre-training language model according to the training data set to obtain the pre-training language fine adjustment model.
Optionally, the method further comprises: and performing question marking and storage according to the reply of the user to the K questions.
The invention also provides an automatic generation system of legal questions, which comprises:
the legal problem set building module is used for building a legal problem set under a specific scene; the legal question set comprises n questions, wherein n is a positive integer greater than 2;
the text clustering module is used for performing text clustering on the legal problem sets based on a semantic similarity principle to obtain m types; each class includes at least one question;
the summing module is used for summing the importance degrees of a plurality of problems in each type to obtain the sum of the importance degrees of the problems;
the problem fusion module is used for inputting a set number of problems to be fused selected from K classes with the highest problem importance sum to a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms and proposing K problems to a user; the problem with the number of problems in each category being greater than 1 is called the problem to be fused.
Optionally, the text clustering module specifically includes:
the vector set construction unit is used for converting the legal problems in the legal problem set into vectors corresponding to the legal problems and constructing a vector set;
the distance determining unit is used for calculating the distance between any two vectors in the vector set by adopting a cosine similarity method;
the initialization unit is used for taking each vector as one type and initializing n types;
the clustering unit is used for combining the two vectors with the minimum distance into one type by adopting an agglomeration hierarchical clustering algorithm;
a judging unit for judging whether the total number of classes is less than or equal to the target number of classes m; if the total number of classes is less than or equal to the target number of classes, outputting each class to form a class set; if the total number of the classes is larger than the number of the target classes, returning to a clustering unit, wherein the class set comprises m classes.
Optionally, the problem fusion module specifically includes:
a problem sequence vector determining unit, configured to insert a mark into a beginning of each problem to be fused, and use a spacer to distinguish multiple input problems to be fused, so as to obtain a problem sequence vector;
the coding unit is used for inputting the problem sequence vector into a coder in a pre-training language fine tuning model for coding to obtain a coding sequence;
the characteristic extraction unit is used for extracting the characteristics of the coding sequence through a deep neural network and a multi-head attention mechanism to obtain a characteristic extraction sequence;
the character restoration unit is used for inputting the feature extraction sequence into a decoder in a pre-training language fine tuning model to perform character restoration, so as to obtain an initial problem;
and the problem conversion unit is used for automatically converting the initial problems into question forms and providing K problems for the user.
Optionally, the system further comprises:
the acquisition module is used for acquiring a training data set; the training data set includes a plurality of questions processed by a data format;
the optimization module is used for optimizing an encoder and a decoder in the pre-training language model;
and the fine tuning module is used for carrying out parameter fine tuning on the optimized pre-training language model according to the training data set to obtain the pre-training language fine tuning model.
Optionally, the system further comprises:
and the marking and storing module is used for marking and storing the questions according to the replies of the K questions from the user.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
in the legal consultation process, a plurality of questions with similar semantics are fused and concreted to form a high-profile fused question, and then the questions are asked simultaneously, so that more information can be obtained in each call, the conversation process is shortened, the working efficiency is accelerated, and the user experience is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of an automatic legal question generation method of the present invention;
FIG. 2 is a diagram of a system for automatically generating legal questions according to the present invention;
FIG. 3 is a schematic diagram of a structure of a text clustering module according to the present invention;
FIG. 4 is a schematic diagram of a problem fusion module according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The invention aims to provide a method and a system for automatically generating legal questions, which are used for simultaneously presenting K questions and improving the working efficiency and the user experience.
In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.
Example 1
As shown in fig. 1, the present invention discloses an automatic generation method of legal questions, which comprises:
step S1: constructing a legal problem set under a specific scene; the legal question set comprises n questions, wherein n is a positive integer larger than 2.
Step S2: based on a semantic similarity principle, performing text clustering on the legal problem set to obtain m types; each class includes at least one question.
And step S3: and adding and summing the importance degrees of a plurality of problems in each category to obtain a problem importance degree sum.
And step S4: inputting a set number of problems to be fused selected from K types with the highest problem importance sum into a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms, and proposing K problems to a user; the problem with the number of problems in each category being greater than 1 is called the problem to be fused.
The individual steps are discussed in detail below:
step S1: constructing a legal problem set under a specific scene; the legal question set Q = { Q = 1 ,Q 2 ,Q 3 ,...,Q n }; wherein Q is n Represents the nth legal question; the specific scenes comprise traffic cases, marital cases, property cases and the like. E.g. Q n Can be "when the case occurred? ".
Step S2: based on a semantic similarity principle, performing text clustering on the legal problem set to obtain m types, specifically comprising:
step S21: converting the legal problems in the legal problem set into vectors corresponding to the legal problems, and constructing a vector set; the set of vectors P = { P = { (P) 1 ,P 2 ,P 3 ,...,P n }; wherein, P n And representing a vector corresponding to the nth legal question.
Step S22: calculating the distance between any two vectors in the vector set by adopting a cosine similarity method; the distance value, in turn, represents the semantic similarity between any two questions.
Step S23: taking each vector as a class, and initially taking n classes.
Step S24: and combining the two vectors with the minimum distance into a class by adopting an agglomeration hierarchical clustering algorithm. The minimum distance in this embodiment represents the nearest semantic similarity.
Step S25: judging whether the total number of classes is less than or equal to the target number m of classes; if the total number of classes is less than or equal to the target number m of classes, outputting each class to form a class set; if the total number of classes is greater than the target number of classes m, returning to the step S24; the class set G = { G 1 ,G 2 ,G 3 ,...,G m }; wherein G is m Represents the mth class;each class includes at least one question; the problem with the number of problems in each category larger than 1 is called the problem to be fused, wherein m is more than or equal to 1 and less than or equal to n.
And step S3: and summing the importance degrees of a plurality of problems in each class to obtain the sum of the importance degrees of the problems.
And step S4: inputting a set number of problems to be fused selected from K classes with the highest problem importance sum into a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms, and proposing K problems to a user; in the embodiment, the set number is 3-4, wherein K is more than or equal to 1 and less than or equal to m.
Step S4 specifically includes:
step S41: inserting a mark into the beginning of each problem to be fused, and distinguishing a plurality of input problems to be fused by using interval segments to obtain a problem sequence vector; the label in this embodiment is a [ CLS ] label.
Step S42: and inputting the problem sequence vector into an encoder in a pre-training language fine tuning model for encoding to obtain an encoding sequence.
Step S43: and performing feature extraction on the coding sequence through a deep neural network and a multi-head attention mechanism to obtain a feature extraction sequence.
Step S44: and inputting the characteristic extraction sequence into a decoder in a pre-training language fine tuning model for character reduction to obtain an initial problem.
Step S45: and automatically converting the initial questions into question forms, and providing K questions for the user.
Step S5: constructing a pre-training language fine-tuning model, which specifically comprises the following steps:
step S51: acquiring a training data set; the training data set includes a plurality of questions that are processed by the data format.
The data format processing specifically comprises:
to distinguish between multiple questions input into the pre-trained language tuning model, an external [ CLS ] is first inserted at the beginning of each question]Special marks, and each [ CLS]The question gathering feature marked before it. At the same time makeThe multiple to-be-fused problems in the input are distinguished by interval embedding. For each send i Segment embedding E A Mark or E B A flag, the embedded flag depending on whether i is odd or even. For example, for a sentence sequence [ sent ] 1 ,sent 2 ,sent 3 ]The invention embeds allocations [ E ] A ,E B ,E c ]. In this way, the present invention will complete the hierarchical learning of the problem sequence representation, with the lower transform layer representing adjacent sentences and the upper layer, in combination with the multi-head attention mechanism, representing multiple problem original sentences.
Step S52: an encoder and a decoder in a pre-trained language model are optimized.
The invention uses a standard encoder-decoder framework to design a generative model. The encoder is a pre-trained BERT model and the decoder is a randomly initialized 6-layer Transformer model.
Step S53: and carrying out parameter fine adjustment on the optimized pre-training language model to obtain the pre-training language fine adjustment model.
The present invention contemplates a fine tuning task that separates the optimizers for the encoder and decoder. The encoder and decoder each use beta 1 =0.8 and β 2 Two optimizers, each encoder and decoder with different warm-up steps and learning rates, = 0.9:
Figure BDA0003061554690000071
Figure BDA0003061554690000072
in which the encoder is
Figure BDA0003061554690000073
warmup D =12000, of decoder
Figure BDA0003061554690000074
warmup D =8000. It is based on the assumption that the pre-trained encoder should be fine-tuned with a smaller learning rate and smoother attenuation. As a result of this, the number of the,the encoder can also be trained with a more accurate gradient when the decoder becomes stable.
The task of the generative model is conceptualized as a sequence-to-sequence problem by creating a small dataset to fine tune the pre-trained model on the basis of a supervised learning algorithm, where the encoder maps the source document to x = [ x = 1 ,...,x n ]Then the decoder generates a target summary y = [ y ] 1 ,...,y n ]And modeling the conditional probability in an autoregressive manner p (y) 1 ,...,y m |x 1 ,...,x n ) The purpose of the fine tuning phase is therefore to optimize the conditional probability P. The model architecture designed by the invention uses the BERT pre-training language model, reduces the training time and cost, but also causes mismatching of the encoder and the decoder, so that the invention adopts different optimization functions to independently optimize the encoder and the decoder so as to solve the problems.
Example 2
As shown in fig. 2, the present invention also provides an automatic legal question generation system, which includes:
a legal problem set constructing module 201, configured to construct a legal problem set in a specific scene; the legal question set comprises n questions, wherein n is a positive integer larger than 2.
The text clustering module 202 is configured to perform text clustering on the legal problem sets based on a semantic similarity principle to obtain m types; each class includes at least one question.
And the summing module 203 is used for summing the importance degrees of a plurality of problems in each category to obtain the sum of the importance degrees of the problems.
The question fusion module 204 is used for inputting a set number of questions to be fused selected from K classes with the highest sum of importance of the questions into the pre-training language fine tuning model for fusion respectively by adopting a text summarization algorithm, automatically converting the questions into question forms and proposing K questions to the user; the problem with the number of problems in each category being greater than 1 is called the problem to be fused.
As an implementation manner, the text clustering module 202 of the present invention specifically includes:
and the vector set construction unit is used for converting the legal problems in the legal problem set into vectors corresponding to the legal problems and constructing a vector set.
And the distance determining unit is used for calculating the distance between any two vectors in the vector set by adopting a cosine similarity method.
And the initialization unit is used for taking each vector as one type and initializing n types.
And the clustering unit is used for combining the two vectors with the minimum distance into one type by adopting an agglomeration hierarchical clustering algorithm.
A judging unit for judging whether the total number of classes is less than or equal to the target number of classes m; if the total number of classes is less than or equal to the target number of classes, outputting each class to form a class set; if the total number of the classes is larger than the number of the target classes, returning to a clustering unit, wherein the class set comprises m classes.
As an embodiment, the problem fusion module 204 specifically includes:
and the problem sequence vector determining unit is used for inserting a mark into the beginning of each problem to be fused, distinguishing the input multiple problems to be fused by using a spacer and obtaining a problem sequence vector.
And the coding unit is used for inputting the problem sequence vector into a coder in a pre-training language fine tuning model for coding to obtain a coding sequence.
And the characteristic extraction unit is used for extracting the characteristics of the coding sequence through a deep neural network and a multi-head attention mechanism to obtain a characteristic extraction sequence.
And the character reduction unit is used for inputting the feature extraction sequence into a decoder in the pre-training language fine tuning model to carry out character reduction so as to obtain an initial problem.
And the question conversion unit is used for automatically converting the initial question into a question form and providing K questions for the user.
As an embodiment, the system of the present invention further comprises:
the acquisition module is used for acquiring a training data set; the training data set includes a plurality of questions that are processed by the data format.
And the optimization module is used for optimizing the encoder and the decoder in the pre-training language model.
And the fine tuning module is used for carrying out parameter fine tuning on the optimized pre-training language model according to the training data set to obtain the pre-training language fine tuning model.
As an embodiment, the system of the present invention further comprises:
and the marking and storing module is used for marking and storing the questions according to the replies of the K questions from the user. In this embodiment, the marking and storing module is actually a question situation library, and stores questions to be asked in different scenarios, answers of the questions, and importance of each question in the scenario. And the questions in the question situation library are continuously updated according to the answers of the users, and the questions with the answers are marked, so that the questions do not appear in the subsequent text clustering process, and the model is prevented from repeatedly asking questions.
Example 3
As shown in fig. 3, the agglomerative hierarchical clustering model mainly includes the following points:
1. the main idea of the agglomerative hierarchical clustering algorithm is to treat each sample point as a class, and then combine the two closest classes (i.e. the agglomerative meaning) repeatedly until the iteration termination condition is met.
2. Firstly, inputting a problem set under a certain legal scene in the form of character strings, converting each problem in the set into a word vector by using a word embedding tool after preprocessing procedures such as word deactivation, coding judgment and the like, and storing the word vector in a list; and then performing coacervation hierarchical clustering on all vectors in the list, calculating cosine similarity between vectors corresponding to each problem as similarity between the vectors, merging two closest sample units according to a single chain principle, and finally obtaining a label list after each problem is clustered.
As shown in fig. 4, the problem fusion module includes the following points:
1. the invention designs a pretrained language fine tuning model based on BERT, which can code short documents and obtain the representation of sentences thereof, wherein each Trm in the upper line of the BERT is connected with each Trm in the lower line. The pre-training language fine tuning model adopts an encoder-decoder framework, and combines a BERT encoder completing a pre-training task with a transform decoder initialized randomly. The optimization of the encoder and decoder is also switched on, i.e. the former is pre-trained, while the latter has to be trained from scratch.
2. The problem of input is first pre-treated by inserting two special marks. [ CLS]Attach at the beginning of the text; the flag is used to indicate the start of the entire document information. And insert [ SEP ] after each sentence]As an indicator of sentence boundaries. The modified text is then represented as a series of tokens X = [ W ] 1 ,W 2 ,...,W 3 ]. The embedded marks are embedded in three ways, namely embedding the marks to represent the meaning of each mark, embedding the marks in a segmentation mode to distinguish two sentences, and embedding the positions of each mark in a text sequence. These three embeddings are added to an input vector before being input into the BERT model.
3. The present invention conceptualizes the task of generating a model as a sequence-to-sequence problem, where the encoder maps the source document to x = [ x = 1 ,...,x n ]Then the decoder generates a target summary y = [ y ] 1 ,...,y n ]And modeling the conditional probability in an autoregressive manner p (y) 1 ,...,y m |x 1 ,...,x n ) Thereby optimizing the problem fusion model.
4. The invention uses a standard encoder-decoder framework to design a generative model. The encoder is a pre-trained BERT model and the decoder is a randomly initialized 6-layer Transformer model. The present invention devises a new fine tuning task that separates the optimizers of the encoder and decoder. The encoder and decoder each use beta 1 =0.8 and β 2 Two optimizers, each encoder and decoder with different warm-up steps and learning rates, = 0.9:
Figure BDA0003061554690000101
Figure BDA0003061554690000102
wherein, the device
Figure BDA0003061554690000103
warmup e =12000, of decoder
Figure BDA0003061554690000104
warmup D =8000. It is based on the assumption that the pre-trained encoder should be fine-tuned with a smaller learning rate and smoother attenuation. Thus, the encoder can also be trained with a more accurate gradient when the decoder becomes stable.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the foregoing, the description is not to be taken in a limiting sense.

Claims (6)

1. An automatic generation method of legal questions is characterized by comprising the following steps:
constructing a legal problem set under a specific scene; the legal question set comprises n questions, wherein n is a positive integer greater than 2;
based on a semantic similarity principle, performing text clustering on the legal problem set to obtain m types; each class includes at least one question;
adding and summing the importance degrees of a plurality of problems in each category to obtain the sum of the importance degrees of the problems;
inputting a set number of problems to be fused selected from K classes with the highest problem importance sum into a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms, and proposing K problems to a user; the problem with the number of the problems in each category being more than 1 is called a problem to be fused;
acquiring a training data set; the training data set includes a plurality of questions processed by a data format;
optimizing an encoder and a decoder in a pre-training language model;
performing parameter fine adjustment on the optimized pre-training language model according to the training data set to obtain a pre-training language fine adjustment model;
the method comprises the following steps of adopting a text abstract algorithm, inputting a set number of problems to be fused selected from K classes with the highest problem importance sum into a pre-training language fine tuning model for fusion, automatically converting the problems into question forms, and providing K problems for a user, wherein the text abstract algorithm specifically comprises the following steps:
inserting a mark into the beginning of each problem to be fused, and distinguishing a plurality of input problems to be fused by using interval segments to obtain a problem sequence vector;
inputting the problem sequence vector into an encoder in a pre-training language fine tuning model for encoding to obtain an encoding sequence;
extracting the characteristics of the coding sequence through a deep neural network and a multi-head attention mechanism to obtain a characteristic extraction sequence;
inputting the characteristic extraction sequence into a decoder in a pre-training language fine tuning model for character reduction to obtain an initial problem;
and automatically converting the initial questions into question forms, and providing K questions for the user.
2. The method according to claim 1, wherein the text clustering is performed on the legal question set based on a semantic similarity principle to obtain m classes, and specifically comprises:
converting the legal problems in the legal problem set into vectors corresponding to the legal problems, and constructing a vector set;
calculating the distance between any two vectors in the vector set by adopting a cosine similarity method;
taking each vector as a class, and initially taking n classes;
merging the two vectors with the minimum distance into a class by adopting a coacervation hierarchical clustering algorithm;
judging whether the total number of classes is less than or equal to the target number m; if the total number of classes is less than or equal to the target number of classes, outputting each class to form a class set; and if the total number of the classes is larger than the number of the target classes, returning to the step of combining the two vectors with the minimum distance into one class by adopting a coacervation hierarchical clustering algorithm, wherein the class set comprises m classes.
3. The method of automatically generating legal questions according to claim 1, further comprising: and performing question marking and storage according to the reply of the user to the K questions.
4. An automatic generation system for legal questions, comprising:
the legal question set building module is used for building a legal question set under a specific scene; the legal question set comprises n questions, wherein n is a positive integer greater than 2;
the text clustering module is used for performing text clustering on the legal problem sets based on a semantic similarity principle to obtain m types; each class includes at least one question;
the summing module is used for summing the importance degrees of a plurality of problems in each category to obtain the sum of the importance degrees of the problems;
the problem fusion module is used for inputting a set number of problems to be fused selected from K classes with the highest problem importance sum to a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms and proposing K problems to a user; the problem with the number of the problems in each category larger than 1 is called a problem to be fused;
the acquisition module is used for acquiring a training data set; the training data set includes a plurality of questions processed by a data format;
the optimization module is used for optimizing an encoder and a decoder in the pre-training language model;
the fine tuning module is used for carrying out parameter fine tuning on the optimized pre-training language model according to the training data set to obtain a pre-training language fine tuning model;
the problem fusion module specifically comprises:
a problem sequence vector determining unit, configured to insert a mark into a beginning of each problem to be fused, and use a spacer to distinguish multiple input problems to be fused, so as to obtain a problem sequence vector;
the coding unit is used for inputting the problem sequence vector into a coder in a pre-training language fine tuning model for coding to obtain a coding sequence;
the characteristic extraction unit is used for extracting the characteristics of the coding sequence through a deep neural network and a multi-head attention mechanism to obtain a characteristic extraction sequence;
the character restoration unit is used for inputting the feature extraction sequence into a decoder in a pre-training language fine tuning model to perform character restoration, so as to obtain an initial problem;
and the problem conversion unit is used for automatically converting the initial problems into question forms and providing K problems for the user.
5. The system according to claim 4, wherein the text clustering module specifically comprises:
the vector set construction unit is used for converting the legal problems in the legal problem set into vectors corresponding to the legal problems and constructing a vector set;
the distance determining unit is used for calculating the distance between any two vectors in the vector set by adopting a cosine similarity method;
the initialization unit is used for taking each vector as one type and initializing n types;
the clustering unit is used for combining the two vectors with the minimum distance into one type by adopting an agglomeration hierarchical clustering algorithm;
a judging unit for judging whether the total number of classes is less than or equal to the target number of classes m; if the total number of classes is less than or equal to the target number of classes, outputting each class to form a class set; if the total number of the classes is larger than the number of the target classes, returning to a clustering unit, wherein the class set comprises m classes.
6. The system for automatically generating legal questions according to claim 4, further comprising:
and the marking and storing module is used for marking and storing the questions according to the replies of the K questions from the user.
CN202110514787.3A 2021-05-12 2021-05-12 Automatic generation method and system for legal questions Active CN113220853B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110514787.3A CN113220853B (en) 2021-05-12 2021-05-12 Automatic generation method and system for legal questions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110514787.3A CN113220853B (en) 2021-05-12 2021-05-12 Automatic generation method and system for legal questions

Publications (2)

Publication Number Publication Date
CN113220853A CN113220853A (en) 2021-08-06
CN113220853B true CN113220853B (en) 2022-10-04

Family

ID=77094843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110514787.3A Active CN113220853B (en) 2021-05-12 2021-05-12 Automatic generation method and system for legal questions

Country Status (1)

Country Link
CN (1) CN113220853B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018086401A1 (en) * 2016-11-14 2018-05-17 平安科技(深圳)有限公司 Cluster processing method and device for questions in automatic question and answering system
CN108090049A (en) * 2018-01-17 2018-05-29 山东工商学院 Multi-document summary extraction method and system based on sentence vector
CN110134771A (en) * 2019-04-09 2019-08-16 广东工业大学 A kind of implementation method based on more attention mechanism converged network question answering systems
CN112749262A (en) * 2020-07-24 2021-05-04 腾讯科技(深圳)有限公司 Question and answer processing method and device based on artificial intelligence, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10262062B2 (en) * 2015-12-21 2019-04-16 Adobe Inc. Natural language system question classifier, semantic representations, and logical form templates
US11157536B2 (en) * 2016-05-03 2021-10-26 International Business Machines Corporation Text simplification for a question and answer system
DK179049B1 (en) * 2016-06-11 2017-09-18 Apple Inc Data driven natural language event detection and classification
CN111339303B (en) * 2020-03-06 2023-08-22 成都晓多科技有限公司 Text intention induction method and device based on clustering and automatic abstracting
CN112765315B (en) * 2021-01-18 2022-09-30 燕山大学 Intelligent classification system and method for legal scenes
CN112765345A (en) * 2021-01-22 2021-05-07 重庆邮电大学 Text abstract automatic generation method and system fusing pre-training model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018086401A1 (en) * 2016-11-14 2018-05-17 平安科技(深圳)有限公司 Cluster processing method and device for questions in automatic question and answering system
CN108090049A (en) * 2018-01-17 2018-05-29 山东工商学院 Multi-document summary extraction method and system based on sentence vector
CN110134771A (en) * 2019-04-09 2019-08-16 广东工业大学 A kind of implementation method based on more attention mechanism converged network question answering systems
CN112749262A (en) * 2020-07-24 2021-05-04 腾讯科技(深圳)有限公司 Question and answer processing method and device based on artificial intelligence, electronic equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Large scale question answering using tourism data;Contractor, D.;《arXiv》;20190908;全文 *
基于层级注意力多通道卷积双向GRU的问题分类研究;余本功等;《数据分析与知识发现》;20200831(第08期);第50-62页 *
基于文本多维度特征的自动摘要生成方法;王青松等;《计算机工程》;20200930(第09期);第110-116页 *

Also Published As

Publication number Publication date
CN113220853A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
CN110795556B (en) Abstract generation method based on fine-grained plug-in decoding
CN110489555B (en) Language model pre-training method combined with similar word information
CN110413746B (en) Method and device for identifying intention of user problem
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN114022882B (en) Text recognition model training method, text recognition device, text recognition equipment and medium
CN112115687B (en) Method for generating problem by combining triplet and entity type in knowledge base
CN108416058A (en) A kind of Relation extraction method based on the enhancing of Bi-LSTM input informations
CN112101010B (en) Telecom industry OA office automation manuscript auditing method based on BERT
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN110750630A (en) Generating type machine reading understanding method, device, equipment and storage medium
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN111814479B (en) Method and device for generating enterprise abbreviations and training model thereof
CN114817467A (en) Intention recognition response method, device, equipment and storage medium
CN112036122A (en) Text recognition method, electronic device and computer readable medium
CN115455946A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN114372140A (en) Layered conference abstract generation model training method, generation method and device
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN113220853B (en) Automatic generation method and system for legal questions
KR20220046771A (en) System and method for providing sentence punctuation
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN112131879A (en) Relationship extraction system, method and device
CN114491004A (en) Title generation method and device, electronic equipment and storage medium
CN110955768B (en) Question-answering system answer generation method based on syntactic analysis
CN114386480A (en) Training method, application method, device and medium of video content description model
CN118468822B (en) Target field text generation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221219

Address after: No. J24, Floor 17, No. 1, Zhongguancun Street, Haidian District, Beijing 100084

Patentee after: Shengming Jizhi (Beijing) Technology Co.,Ltd.

Address before: No.438, west section of Hebei Street, Haigang District, Qinhuangdao City, Hebei Province

Patentee before: Yanshan University