CN113220853B

CN113220853B - Automatic generation method and system for legal questions

Info

Publication number: CN113220853B
Application number: CN202110514787.3A
Authority: CN
Inventors: 冯建周; 龙景; 韩春龙; 邵文彪
Original assignee: Yanshan University
Current assignee: Shengming Jizhi Beijing Technology Co ltd
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2022-10-04
Anticipated expiration: 2041-05-12
Also published as: CN113220853A

Abstract

The invention relates to a method and a system for automatically generating legal questions, which comprises the steps of firstly, constructing a legal question set under a specific scene; the legal question set comprises n questions; secondly, based on a semantic similarity principle, performing text clustering on the legal problem set to obtain m types; then adding and summing the importance degrees of a plurality of problems in each category to obtain the sum of the importance degrees of the problems; and finally, inputting a set number of problems to be fused selected from the K classes with the highest problem importance sum to a pre-training language fine tuning model by adopting a text abstract algorithm for fusion respectively, automatically converting the problems into question forms, and proposing the K problems to a user. In the legal consultation process, a plurality of questions with similar semantics are fused and condensed to form a high-profile fusion question without throwing out a single question one by one, and then a plurality of questions are asked simultaneously, so that more information can be obtained in each pair of conversations, the conversation process is shortened, the working efficiency is accelerated, and the user experience is improved.

Description

Automatic generation method and system for legal questions

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a system for automatically generating legal questions.

Background

Most of the existing legal consultation systems design a large number of questions aiming at a specific scene, ask questions of a user in sequence, and perform branch judgment according to answers of the user to finally obtain a consultation result. However, the consultation process of the consultation system is complicated and single, and only single problems can be thrown one by one, so that the working efficiency and the user experience are influenced.

Disclosure of Invention

The invention aims to provide a method and a system for automatically generating legal questions, which are used for simultaneously presenting K questions and improving the working efficiency and the user experience.

In order to achieve the above object, the present invention provides an automatic generation method of legal questions, comprising:

constructing a legal problem set under a specific scene; the legal question set comprises n questions, wherein n is a positive integer greater than 2;

based on a semantic similarity principle, performing text clustering on the legal question set to obtain m types; each class includes at least one question;

adding and summing the importance degrees of a plurality of problems in each category to obtain the sum of the importance degrees of the problems;

inputting a set number of problems to be fused selected from K types with the highest problem importance sum into a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms, and proposing K problems to a user; the problem with the number of problems in each category being greater than 1 is called the problem to be fused.

Optionally, based on the semantic similarity principle, performing text clustering on the legal problem set to obtain m classes, which specifically includes:

converting the legal problems in the legal problem set into vectors corresponding to the legal problems, and constructing a vector set;

calculating the distance between any two vectors in the vector set by adopting a cosine similarity method;

taking each vector as a class, and initially counting n classes;

merging the two vectors with the minimum distance into a class by adopting a coacervation hierarchical clustering algorithm;

judging whether the total number of classes is less than or equal to the target number m; if the total number of classes is less than or equal to the target number of classes, outputting each class to form a class set; and if the total number of the classes is greater than the number of the target classes, returning to the step of combining the two vectors with the minimum distance into one class by adopting an agglomeration hierarchical clustering algorithm, wherein the class set comprises m classes.

Optionally, the method includes selecting a set number of to-be-fused questions from the K classes with the highest problem importance sum by using a text summarization algorithm, inputting the selected to-be-fused questions into a pre-trained language fine tuning model for fusion, automatically converting the to-be-fused questions into question forms, and presenting the K questions to a user, and specifically includes:

inserting a mark into the beginning of each problem to be fused, and distinguishing a plurality of input problems to be fused by using interval segments to obtain a problem sequence vector;

inputting the problem sequence vector into a coder in a pre-training language fine tuning model for coding to obtain a coding sequence;

extracting the characteristics of the coding sequence through a deep neural network and a multi-head attention mechanism to obtain a characteristic extraction sequence;

inputting the characteristic extraction sequence into a decoder in a pre-training language fine tuning model for character reduction to obtain an initial problem;

and automatically converting the initial questions into question forms, and presenting K questions to the user.

Optionally, before the step of inserting a marker at the beginning of each problem to be fused and distinguishing the plurality of input problems to be fused by using a spacer, the step of obtaining a problem sequence vector further includes:

acquiring a training data set; the training data set includes a plurality of questions processed by a data format;

optimizing an encoder and a decoder in a pre-training language model;

and carrying out parameter fine adjustment on the optimized pre-training language model according to the training data set to obtain the pre-training language fine adjustment model.

Optionally, the method further comprises: and performing question marking and storage according to the reply of the user to the K questions.

The invention also provides an automatic generation system of legal questions, which comprises:

the legal problem set building module is used for building a legal problem set under a specific scene; the legal question set comprises n questions, wherein n is a positive integer greater than 2;

the text clustering module is used for performing text clustering on the legal problem sets based on a semantic similarity principle to obtain m types; each class includes at least one question;

the summing module is used for summing the importance degrees of a plurality of problems in each type to obtain the sum of the importance degrees of the problems;

the problem fusion module is used for inputting a set number of problems to be fused selected from K classes with the highest problem importance sum to a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms and proposing K problems to a user; the problem with the number of problems in each category being greater than 1 is called the problem to be fused.

Optionally, the text clustering module specifically includes:

the vector set construction unit is used for converting the legal problems in the legal problem set into vectors corresponding to the legal problems and constructing a vector set;

the distance determining unit is used for calculating the distance between any two vectors in the vector set by adopting a cosine similarity method;

the initialization unit is used for taking each vector as one type and initializing n types;

the clustering unit is used for combining the two vectors with the minimum distance into one type by adopting an agglomeration hierarchical clustering algorithm;

a judging unit for judging whether the total number of classes is less than or equal to the target number of classes m; if the total number of classes is less than or equal to the target number of classes, outputting each class to form a class set; if the total number of the classes is larger than the number of the target classes, returning to a clustering unit, wherein the class set comprises m classes.

Optionally, the problem fusion module specifically includes:

a problem sequence vector determining unit, configured to insert a mark into a beginning of each problem to be fused, and use a spacer to distinguish multiple input problems to be fused, so as to obtain a problem sequence vector;

the coding unit is used for inputting the problem sequence vector into a coder in a pre-training language fine tuning model for coding to obtain a coding sequence;

the characteristic extraction unit is used for extracting the characteristics of the coding sequence through a deep neural network and a multi-head attention mechanism to obtain a characteristic extraction sequence;

the character restoration unit is used for inputting the feature extraction sequence into a decoder in a pre-training language fine tuning model to perform character restoration, so as to obtain an initial problem;

and the problem conversion unit is used for automatically converting the initial problems into question forms and providing K problems for the user.

Optionally, the system further comprises:

the acquisition module is used for acquiring a training data set; the training data set includes a plurality of questions processed by a data format;

the optimization module is used for optimizing an encoder and a decoder in the pre-training language model;

and the fine tuning module is used for carrying out parameter fine tuning on the optimized pre-training language model according to the training data set to obtain the pre-training language fine tuning model.

Optionally, the system further comprises:

and the marking and storing module is used for marking and storing the questions according to the replies of the K questions from the user.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

in the legal consultation process, a plurality of questions with similar semantics are fused and concreted to form a high-profile fused question, and then the questions are asked simultaneously, so that more information can be obtained in each call, the conversation process is shortened, the working efficiency is accelerated, and the user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of an automatic legal question generation method of the present invention;

FIG. 2 is a diagram of a system for automatically generating legal questions according to the present invention;

FIG. 3 is a schematic diagram of a structure of a text clustering module according to the present invention;

FIG. 4 is a schematic diagram of a problem fusion module according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.

Example 1

As shown in fig. 1, the present invention discloses an automatic generation method of legal questions, which comprises:

step S1: constructing a legal problem set under a specific scene; the legal question set comprises n questions, wherein n is a positive integer larger than 2.

Step S2: based on a semantic similarity principle, performing text clustering on the legal problem set to obtain m types; each class includes at least one question.

And step S3: and adding and summing the importance degrees of a plurality of problems in each category to obtain a problem importance degree sum.

And step S4: inputting a set number of problems to be fused selected from K types with the highest problem importance sum into a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms, and proposing K problems to a user; the problem with the number of problems in each category being greater than 1 is called the problem to be fused.

The individual steps are discussed in detail below:

step S1: constructing a legal problem set under a specific scene; the legal question set Q = { Q = ₁ ,Q ₂ ,Q ₃ ,...,Q _n }; wherein Q is _n Represents the nth legal question; the specific scenes comprise traffic cases, marital cases, property cases and the like. E.g. Q _n Can be "when the case occurred? ".

Step S2: based on a semantic similarity principle, performing text clustering on the legal problem set to obtain m types, specifically comprising:

step S21: converting the legal problems in the legal problem set into vectors corresponding to the legal problems, and constructing a vector set; the set of vectors P = { P = { (P) ₁ ,P ₂ ,P ₃ ,...,P _n }; wherein, P _n And representing a vector corresponding to the nth legal question.

Step S22: calculating the distance between any two vectors in the vector set by adopting a cosine similarity method; the distance value, in turn, represents the semantic similarity between any two questions.

Step S23: taking each vector as a class, and initially taking n classes.

Step S24: and combining the two vectors with the minimum distance into a class by adopting an agglomeration hierarchical clustering algorithm. The minimum distance in this embodiment represents the nearest semantic similarity.

Step S25: judging whether the total number of classes is less than or equal to the target number m of classes; if the total number of classes is less than or equal to the target number m of classes, outputting each class to form a class set; if the total number of classes is greater than the target number of classes m, returning to the step S24; the class set G = { G ₁ ,G ₂ ,G ₃ ,...,G _m }; wherein G is _m Represents the mth class;each class includes at least one question; the problem with the number of problems in each category larger than 1 is called the problem to be fused, wherein m is more than or equal to 1 and less than or equal to n.

And step S3: and summing the importance degrees of a plurality of problems in each class to obtain the sum of the importance degrees of the problems.

And step S4: inputting a set number of problems to be fused selected from K classes with the highest problem importance sum into a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms, and proposing K problems to a user; in the embodiment, the set number is 3-4, wherein K is more than or equal to 1 and less than or equal to m.

Step S4 specifically includes:

step S41: inserting a mark into the beginning of each problem to be fused, and distinguishing a plurality of input problems to be fused by using interval segments to obtain a problem sequence vector; the label in this embodiment is a [ CLS ] label.

Step S42: and inputting the problem sequence vector into an encoder in a pre-training language fine tuning model for encoding to obtain an encoding sequence.

Step S43: and performing feature extraction on the coding sequence through a deep neural network and a multi-head attention mechanism to obtain a feature extraction sequence.

Step S44: and inputting the characteristic extraction sequence into a decoder in a pre-training language fine tuning model for character reduction to obtain an initial problem.

Step S45: and automatically converting the initial questions into question forms, and providing K questions for the user.

Step S5: constructing a pre-training language fine-tuning model, which specifically comprises the following steps:

step S51: acquiring a training data set; the training data set includes a plurality of questions that are processed by the data format.

The data format processing specifically comprises:

to distinguish between multiple questions input into the pre-trained language tuning model, an external [ CLS ] is first inserted at the beginning of each question]Special marks, and each [ CLS]The question gathering feature marked before it. At the same time makeThe multiple to-be-fused problems in the input are distinguished by interval embedding. For each send _i Segment embedding E _A Mark or E _B A flag, the embedded flag depending on whether i is odd or even. For example, for a sentence sequence [ sent ] ₁ ,sent ₂ ,sent ₃ ]The invention embeds allocations [ E ] _A ,E _B ,E _c ]. In this way, the present invention will complete the hierarchical learning of the problem sequence representation, with the lower transform layer representing adjacent sentences and the upper layer, in combination with the multi-head attention mechanism, representing multiple problem original sentences.

Step S52: an encoder and a decoder in a pre-trained language model are optimized.

The invention uses a standard encoder-decoder framework to design a generative model. The encoder is a pre-trained BERT model and the decoder is a randomly initialized 6-layer Transformer model.

Step S53: and carrying out parameter fine adjustment on the optimized pre-training language model to obtain the pre-training language fine adjustment model.

The present invention contemplates a fine tuning task that separates the optimizers for the encoder and decoder. The encoder and decoder each use beta ₁ =0.8 and β ₂ Two optimizers, each encoder and decoder with different warm-up steps and learning rates, = 0.9:

in which the encoder is

warmup _D =12000, of decoder

warmup _D =8000. It is based on the assumption that the pre-trained encoder should be fine-tuned with a smaller learning rate and smoother attenuation. As a result of this, the number of the,the encoder can also be trained with a more accurate gradient when the decoder becomes stable.

The task of the generative model is conceptualized as a sequence-to-sequence problem by creating a small dataset to fine tune the pre-trained model on the basis of a supervised learning algorithm, where the encoder maps the source document to x = [ x = ₁ ,...,x _n ]Then the decoder generates a target summary y = [ y ] ₁ ,...,y _n ]And modeling the conditional probability in an autoregressive manner p (y) ₁ ,...,y _m |x ₁ ,...,x _n ) The purpose of the fine tuning phase is therefore to optimize the conditional probability P. The model architecture designed by the invention uses the BERT pre-training language model, reduces the training time and cost, but also causes mismatching of the encoder and the decoder, so that the invention adopts different optimization functions to independently optimize the encoder and the decoder so as to solve the problems.

Example 2

As shown in fig. 2, the present invention also provides an automatic legal question generation system, which includes:

a legal problem set constructing module 201, configured to construct a legal problem set in a specific scene; the legal question set comprises n questions, wherein n is a positive integer larger than 2.

The text clustering module 202 is configured to perform text clustering on the legal problem sets based on a semantic similarity principle to obtain m types; each class includes at least one question.

And the summing module 203 is used for summing the importance degrees of a plurality of problems in each category to obtain the sum of the importance degrees of the problems.

The question fusion module 204 is used for inputting a set number of questions to be fused selected from K classes with the highest sum of importance of the questions into the pre-training language fine tuning model for fusion respectively by adopting a text summarization algorithm, automatically converting the questions into question forms and proposing K questions to the user; the problem with the number of problems in each category being greater than 1 is called the problem to be fused.

As an implementation manner, the text clustering module 202 of the present invention specifically includes:

and the vector set construction unit is used for converting the legal problems in the legal problem set into vectors corresponding to the legal problems and constructing a vector set.

And the distance determining unit is used for calculating the distance between any two vectors in the vector set by adopting a cosine similarity method.

And the initialization unit is used for taking each vector as one type and initializing n types.

And the clustering unit is used for combining the two vectors with the minimum distance into one type by adopting an agglomeration hierarchical clustering algorithm.

As an embodiment, the problem fusion module 204 specifically includes:

and the problem sequence vector determining unit is used for inserting a mark into the beginning of each problem to be fused, distinguishing the input multiple problems to be fused by using a spacer and obtaining a problem sequence vector.

And the coding unit is used for inputting the problem sequence vector into a coder in a pre-training language fine tuning model for coding to obtain a coding sequence.

And the characteristic extraction unit is used for extracting the characteristics of the coding sequence through a deep neural network and a multi-head attention mechanism to obtain a characteristic extraction sequence.

And the character reduction unit is used for inputting the feature extraction sequence into a decoder in the pre-training language fine tuning model to carry out character reduction so as to obtain an initial problem.

And the question conversion unit is used for automatically converting the initial question into a question form and providing K questions for the user.

As an embodiment, the system of the present invention further comprises:

the acquisition module is used for acquiring a training data set; the training data set includes a plurality of questions that are processed by the data format.

And the optimization module is used for optimizing the encoder and the decoder in the pre-training language model.

As an embodiment, the system of the present invention further comprises:

and the marking and storing module is used for marking and storing the questions according to the replies of the K questions from the user. In this embodiment, the marking and storing module is actually a question situation library, and stores questions to be asked in different scenarios, answers of the questions, and importance of each question in the scenario. And the questions in the question situation library are continuously updated according to the answers of the users, and the questions with the answers are marked, so that the questions do not appear in the subsequent text clustering process, and the model is prevented from repeatedly asking questions.

Example 3

As shown in fig. 3, the agglomerative hierarchical clustering model mainly includes the following points:

1. the main idea of the agglomerative hierarchical clustering algorithm is to treat each sample point as a class, and then combine the two closest classes (i.e. the agglomerative meaning) repeatedly until the iteration termination condition is met.

2. Firstly, inputting a problem set under a certain legal scene in the form of character strings, converting each problem in the set into a word vector by using a word embedding tool after preprocessing procedures such as word deactivation, coding judgment and the like, and storing the word vector in a list; and then performing coacervation hierarchical clustering on all vectors in the list, calculating cosine similarity between vectors corresponding to each problem as similarity between the vectors, merging two closest sample units according to a single chain principle, and finally obtaining a label list after each problem is clustered.

As shown in fig. 4, the problem fusion module includes the following points:

1. the invention designs a pretrained language fine tuning model based on BERT, which can code short documents and obtain the representation of sentences thereof, wherein each Trm in the upper line of the BERT is connected with each Trm in the lower line. The pre-training language fine tuning model adopts an encoder-decoder framework, and combines a BERT encoder completing a pre-training task with a transform decoder initialized randomly. The optimization of the encoder and decoder is also switched on, i.e. the former is pre-trained, while the latter has to be trained from scratch.

2. The problem of input is first pre-treated by inserting two special marks. [ CLS]Attach at the beginning of the text; the flag is used to indicate the start of the entire document information. And insert [ SEP ] after each sentence]As an indicator of sentence boundaries. The modified text is then represented as a series of tokens X = [ W ] ₁ ,W ₂ ,...,W ₃ ]. The embedded marks are embedded in three ways, namely embedding the marks to represent the meaning of each mark, embedding the marks in a segmentation mode to distinguish two sentences, and embedding the positions of each mark in a text sequence. These three embeddings are added to an input vector before being input into the BERT model.

3. The present invention conceptualizes the task of generating a model as a sequence-to-sequence problem, where the encoder maps the source document to x = [ x = ₁ ,...,x _n ]Then the decoder generates a target summary y = [ y ] ₁ ,...,y _n ]And modeling the conditional probability in an autoregressive manner p (y) ₁ ,...,y _m |x ₁ ,...,x _n ) Thereby optimizing the problem fusion model.

4. The invention uses a standard encoder-decoder framework to design a generative model. The encoder is a pre-trained BERT model and the decoder is a randomly initialized 6-layer Transformer model. The present invention devises a new fine tuning task that separates the optimizers of the encoder and decoder. The encoder and decoder each use beta ₁ =0.8 and β ₂ Two optimizers, each encoder and decoder with different warm-up steps and learning rates, = 0.9:

wherein, the device

warmup _e =12000, of decoder

warmup _D =8000. It is based on the assumption that the pre-trained encoder should be fine-tuned with a smaller learning rate and smoother attenuation. Thus, the encoder can also be trained with a more accurate gradient when the decoder becomes stable.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the foregoing, the description is not to be taken in a limiting sense.

Claims

1. An automatic generation method of legal questions is characterized by comprising the following steps:

based on a semantic similarity principle, performing text clustering on the legal problem set to obtain m types; each class includes at least one question;

inputting a set number of problems to be fused selected from K classes with the highest problem importance sum into a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms, and proposing K problems to a user; the problem with the number of the problems in each category being more than 1 is called a problem to be fused;

optimizing an encoder and a decoder in a pre-training language model;

performing parameter fine adjustment on the optimized pre-training language model according to the training data set to obtain a pre-training language fine adjustment model;

the method comprises the following steps of adopting a text abstract algorithm, inputting a set number of problems to be fused selected from K classes with the highest problem importance sum into a pre-training language fine tuning model for fusion, automatically converting the problems into question forms, and providing K problems for a user, wherein the text abstract algorithm specifically comprises the following steps:

inputting the problem sequence vector into an encoder in a pre-training language fine tuning model for encoding to obtain an encoding sequence;

and automatically converting the initial questions into question forms, and providing K questions for the user.

2. The method according to claim 1, wherein the text clustering is performed on the legal question set based on a semantic similarity principle to obtain m classes, and specifically comprises:

taking each vector as a class, and initially taking n classes;

judging whether the total number of classes is less than or equal to the target number m; if the total number of classes is less than or equal to the target number of classes, outputting each class to form a class set; and if the total number of the classes is larger than the number of the target classes, returning to the step of combining the two vectors with the minimum distance into one class by adopting a coacervation hierarchical clustering algorithm, wherein the class set comprises m classes.

3. The method of automatically generating legal questions according to claim 1, further comprising: and performing question marking and storage according to the reply of the user to the K questions.

4. An automatic generation system for legal questions, comprising:

the legal question set building module is used for building a legal question set under a specific scene; the legal question set comprises n questions, wherein n is a positive integer greater than 2;

the summing module is used for summing the importance degrees of a plurality of problems in each category to obtain the sum of the importance degrees of the problems;

the problem fusion module is used for inputting a set number of problems to be fused selected from K classes with the highest problem importance sum to a pre-training language fine tuning model for fusion respectively by adopting a text abstract algorithm, automatically converting the problems into question forms and proposing K problems to a user; the problem with the number of the problems in each category larger than 1 is called a problem to be fused;

the fine tuning module is used for carrying out parameter fine tuning on the optimized pre-training language model according to the training data set to obtain a pre-training language fine tuning model;

the problem fusion module specifically comprises:

5. The system according to claim 4, wherein the text clustering module specifically comprises:

6. The system for automatically generating legal questions according to claim 4, further comprising: